LISTSERV 16.5 - CODE4LIB Archives

As we move towards experimenting with a Solr-based opac I'm hoping to
persuade everyone involved that MODS is sufficient to drive the search
interface. Let MARC abide in the ILS, and become a mere spirit of malice
that gnaws itself in the shadows, but cannot again grow or take shape.

Peter

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Andrew Nagy
Sent: Wednesday, November 29, 2006 8:14 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] code4lib lucene pre-conference

Clay Redding wrote:

> Hi Andrew (or anyone else that cares to answer),
>
> I've missed out on hearing about incompatabilites between MARCXML and
> NXDBs.   Can you explain?  Is this just eXist and Sleepycat, or are
> there others?  I seem to recall putting a few records in X-Hive with
> no problems, but I didn't put it through any paces.

Yes, I have only done my testing with eXist and Sleepycat, but I also
have an implementation of MarkLogic that I would like to test out.  I
imagine though that all NXDBs will have the same problem.  This is the
heart of my proposed talk.  It has to do with the layout of marcxml.
Adding a few records to any NXDB will work like a charm, do your testing
with 250,000+ records and then you will begin to see the true spirit of
your NXDB.

> Also, if there was a cure to the problems with MARCXML (I'm sure we
> can all think of some), what would you suggest to help alleviate the
> problems?

Sure, I know of a cure!  I have come up with a modified marcxml schema,
but as I am investigating SOLR further, I think the solr schema is also
a cure.

The problem with MARXML is the fact that all of the elements have the
same name and then use the attributes to differentiate them, (excuse my
while I barf) this makes indexing at the XML level very difficult,
especially for NXDBs.  I got a concurring agreement from main developers
of both packages (exist, berkeley) in this front.  My schema just puts
all of the marc fields into it's own element.  Instead of <datafield
code="245">, I created a field called <T245> and instead of all of the
subfields in multiple tags, i just put all of the subfields into one
element.  No one needs to search (from my perspective) the subtitle
("b") separately from the main ("a") title, so I just made a really
simple xml document that is 1/4 the size.  By doing this I was able to
take a 45 minute search of marcxml records and reduce it down to results
in 1 second.  The main boost was not the reduction in file size, but the
way the indexing works.

Give it a shot, I promise better results!

Andrew