On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote:
> On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[log in to unmask]> wrote:
>
> >
> > Casey, you say you're getting indexing times of 1000 records /
> > second? That's amazing! I really have to take a closer look at
> > MarcThing. Could pymarc really be that much faster than marc4j? Or
> > are we comparing apples to oranges since we haven't normalized for
> > the kinds of mapping we're doing and the hardware it's running on?
> >
>
> Well, you can't take a closer look at it yet, since I haven't gotten off my
> lazy butt and released it. We're still using an older version of the
> project in production on LT. I'm going to cut us over to the latest version
> this weekend. At this point, being able to say we eat our own dogfood is
> the only barrier to release.
Looking forward to the release. I'd be interested to see how it
compares to the pymarc-indexer branch in FBO/Helios [1].
> The last indexer I wrote (the one used by fac-back-opac) used marc4j and was
> around 100-150 a second. Some of the boost was due to better designed code
> on my end, but I can't take too much credit. Pymarc is much, much faster.
> I never bothered to figure out why. (That wasn't why I switched, though --
> there are some problems with parsing ANSEL with marc4j (*) which I decided
> I'd rather be mauled by bears than try and fix -- the performance boost was
> just a pleasant surprise). Of course one could use pymarc from java with
> Jython.
On the small set of documents I'm now indexing (3327) I get 141
rec/sec. This is on my test server, an AMD64 whose processor speed I
can't recall. That rate includes pymarc processing (~65%) and the
loading of the CSV file into SOLR (~35%). Surely there's some room
there for optimization, but it's fast enough for my current purposes.
Also, I'm in the camp that would be happy with a ~10,000 record test
set. There will always be some edge cases that we'll only solve as
they're encountered. I need rapid iteration!
Gabriel
[1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer
|