LISTSERV 16.5 - CODE4LIB Archives

On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[log in to unmask]> wrote:

>
> Casey, you say you're getting indexing times of 1000 records /
> second? That's amazing! I really have to take a closer look at
> MarcThing. Could pymarc really be that much faster than marc4j? Or
> are we comparing apples to oranges since we haven't normalized for
> the kinds of mapping we're doing and the hardware it's running on?
>

Well, you can't take a closer look at it yet, since I haven't gotten off my
lazy butt and released it.  We're still using an older version of the
project in production on LT.  I'm going to cut us over to the latest version
this weekend.  At this point, being able to say we eat our own dogfood is
the only barrier to release.

The last indexer I wrote (the one used by fac-back-opac) used marc4j and was
around 100-150 a second.  Some of the boost was due to better designed code
on my end, but I can't take too much credit.  Pymarc is much, much faster.
I never bothered to figure out why.  (That wasn't why I switched, though --
there are some problems with parsing ANSEL with marc4j (*) which I decided
I'd rather be mauled by bears than try and fix -- the performance boost was
just a pleasant surprise). Of course one could use pymarc from java with
Jython.

Undoubtedly we're comparing apples to oranges here.  1000/sec. is about what
I can get on my Macbook Pro on some random MARC records I have lying around,
with plenty of hand-waving involved.  MARCThing does do a fair amount of
munging for expanding codes, guessing physical format and what-have-you (but
nothing with dates, which is sorely needed), but I think it would be a bad
idea to read too much into some anecdotal numbers.

--Casey

(*) in marc4j's defense, actually due to a bug in Horizon ILS.