On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote: > On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[log in to unmask]> wrote: > > > > > Casey, you say you're getting indexing times of 1000 records / > > second? That's amazing! I really have to take a closer look at > > MarcThing. Could pymarc really be that much faster than marc4j? Or > > are we comparing apples to oranges since we haven't normalized for > > the kinds of mapping we're doing and the hardware it's running on? > > > > Well, you can't take a closer look at it yet, since I haven't gotten off my > lazy butt and released it. We're still using an older version of the > project in production on LT. I'm going to cut us over to the latest version > this weekend. At this point, being able to say we eat our own dogfood is > the only barrier to release. Looking forward to the release. I'd be interested to see how it compares to the pymarc-indexer branch in FBO/Helios [1]. > The last indexer I wrote (the one used by fac-back-opac) used marc4j and was > around 100-150 a second. Some of the boost was due to better designed code > on my end, but I can't take too much credit. Pymarc is much, much faster. > I never bothered to figure out why. (That wasn't why I switched, though -- > there are some problems with parsing ANSEL with marc4j (*) which I decided > I'd rather be mauled by bears than try and fix -- the performance boost was > just a pleasant surprise). Of course one could use pymarc from java with > Jython. On the small set of documents I'm now indexing (3327) I get 141 rec/sec. This is on my test server, an AMD64 whose processor speed I can't recall. That rate includes pymarc processing (~65%) and the loading of the CSV file into SOLR (~35%). Surely there's some room there for optimization, but it's fast enough for my current purposes. Also, I'm in the camp that would be happy with a ~10,000 record test set. There will always be some edge cases that we'll only solve as they're encountered. I need rapid iteration! Gabriel [1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer