As we move towards experimenting with a Solr-based opac I'm hoping to
persuade everyone involved that MODS is sufficient to drive the search
interface. Let MARC abide in the ILS, and become a mere spirit of malice
that gnaws itself in the shadows, but cannot again grow or take shape.
Peter
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Andrew Nagy
Sent: Wednesday, November 29, 2006 8:14 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] code4lib lucene pre-conference
Clay Redding wrote:
> Hi Andrew (or anyone else that cares to answer),
>
> I've missed out on hearing about incompatabilites between MARCXML and
> NXDBs. Can you explain? Is this just eXist and Sleepycat, or are
> there others? I seem to recall putting a few records in X-Hive with
> no problems, but I didn't put it through any paces.
Yes, I have only done my testing with eXist and Sleepycat, but I also
have an implementation of MarkLogic that I would like to test out. I
imagine though that all NXDBs will have the same problem. This is the
heart of my proposed talk. It has to do with the layout of marcxml.
Adding a few records to any NXDB will work like a charm, do your testing
with 250,000+ records and then you will begin to see the true spirit of
your NXDB.
> Also, if there was a cure to the problems with MARCXML (I'm sure we
> can all think of some), what would you suggest to help alleviate the
> problems?
Sure, I know of a cure! I have come up with a modified marcxml schema,
but as I am investigating SOLR further, I think the solr schema is also
a cure.
The problem with MARXML is the fact that all of the elements have the
same name and then use the attributes to differentiate them, (excuse my
while I barf) this makes indexing at the XML level very difficult,
especially for NXDBs. I got a concurring agreement from main developers
of both packages (exist, berkeley) in this front. My schema just puts
all of the marc fields into it's own element. Instead of <datafield
code="245">, I created a field called <T245> and instead of all of the
subfields in multiple tags, i just put all of the subfields into one
element. No one needs to search (from my perspective) the subtitle
("b") separately from the main ("a") title, so I just made a really
simple xml document that is 1/4 the size. By doing this I was able to
take a 45 minute search of marcxml records and reduce it down to results
in 1 second. The main boost was not the reduction in file size, but the
way the indexing works.
Give it a shot, I promise better results!
Andrew
|