LISTSERV 16.5 - CODE4LIB Archives

I think this is a data structure problem... MARC is well structured
for compact transmission (or was at one point) but not so much for
data (re)use (in my opinion).

One solution, as Erik has suggested, is to parse the data and build
intelligible indices.  Another, as Andrew suggests (and which I think
Endeca does too at least as a preliminary step), is to map to a more
reasonably arranged data structure (in XML) and index that.

Fwiw Andrew, I'd suggest you are not seeing the "true spirit of your
NXDB."  Try to put MARC into a RDBMS and you are going to run into the
same problem.  You have to index intelligently or reorganize the data
(which is the default when you put XML into a RDBMS anyway).  Perhaps
a criticism of NXDBs could be that they make sound like they can
handle anything you throw at them without regard for what that is...
"If it is XML, we can handle it."

Data can have a structure that makes it more accessible or less.  The
promise of XML (as a storage format rather than transmission format
(which is its other purpose)) is that you can work with data in its
native format (no deconstruction necessary).  However, there is
nothing about XML or NXDBs that makes one use a well structured data
format.

Kevin

ps: I'm still reeling at the idea of Elsevier making Lucene indices
available... wow, neat idea.

On 11/29/06, Andrew Nagy <[log in to unmask]> wrote:
> Clay Redding wrote:
>
> > Hi Andrew (or anyone else that cares to answer),
> >
> > I've missed out on hearing about incompatabilites between MARCXML and
> > NXDBs.   Can you explain?  Is this just eXist and Sleepycat, or are
> > there others?  I seem to recall putting a few records in X-Hive with no
> > problems, but I didn't put it through any paces.
>
> Yes, I have only done my testing with eXist and Sleepycat, but I also
> have an implementation of MarkLogic that I would like to test out.  I
> imagine though that all NXDBs will have the same problem.  This is the
> heart of my proposed talk.  It has to do with the layout of marcxml.
> Adding a few records to any NXDB will work like a charm, do your testing
> with 250,000+ records and then you will begin to see the true spirit of
> your NXDB.
>
> > Also, if there was a cure to the problems with MARCXML (I'm sure we can
> > all think of some), what would you suggest to help alleviate the
> > problems?
>
> Sure, I know of a cure!  I have come up with a modified marcxml schema,
> but as I am investigating SOLR further, I think the solr schema is also
> a cure.
>
> The problem with MARXML is the fact that all of the elements have the
> same name and then use the attributes to differentiate them, (excuse my
> while I barf) this makes indexing at the XML level very difficult,
> especially for NXDBs.  I got a concurring agreement from main developers
> of both packages (exist, berkeley) in this front.  My schema just puts
> all of the marc fields into it's own element.  Instead of <datafield
> code="245">, I created a field called <T245> and instead of all of the
> subfields in multiple tags, i just put all of the subfields into one
> element.  No one needs to search (from my perspective) the subtitle
> ("b") separately from the main ("a") title, so I just made a really
> simple xml document that is 1/4 the size.  By doing this I was able to
> take a 45 minute search of marcxml records and reduce it down to results
> in 1 second.  The main boost was not the reduction in file size, but the
> way the indexing works.
>
> Give it a shot, I promise better results!
>
> Andrew
>