I think this is a data structure problem... MARC is well structured for compact transmission (or was at one point) but not so much for data (re)use (in my opinion). One solution, as Erik has suggested, is to parse the data and build intelligible indices. Another, as Andrew suggests (and which I think Endeca does too at least as a preliminary step), is to map to a more reasonably arranged data structure (in XML) and index that. Fwiw Andrew, I'd suggest you are not seeing the "true spirit of your NXDB." Try to put MARC into a RDBMS and you are going to run into the same problem. You have to index intelligently or reorganize the data (which is the default when you put XML into a RDBMS anyway). Perhaps a criticism of NXDBs could be that they make sound like they can handle anything you throw at them without regard for what that is... "If it is XML, we can handle it." Data can have a structure that makes it more accessible or less. The promise of XML (as a storage format rather than transmission format (which is its other purpose)) is that you can work with data in its native format (no deconstruction necessary). However, there is nothing about XML or NXDBs that makes one use a well structured data format. Kevin ps: I'm still reeling at the idea of Elsevier making Lucene indices available... wow, neat idea. On 11/29/06, Andrew Nagy <[log in to unmask]> wrote: > Clay Redding wrote: > > > Hi Andrew (or anyone else that cares to answer), > > > > I've missed out on hearing about incompatabilites between MARCXML and > > NXDBs. Can you explain? Is this just eXist and Sleepycat, or are > > there others? I seem to recall putting a few records in X-Hive with no > > problems, but I didn't put it through any paces. > > Yes, I have only done my testing with eXist and Sleepycat, but I also > have an implementation of MarkLogic that I would like to test out. I > imagine though that all NXDBs will have the same problem. This is the > heart of my proposed talk. It has to do with the layout of marcxml. > Adding a few records to any NXDB will work like a charm, do your testing > with 250,000+ records and then you will begin to see the true spirit of > your NXDB. > > > Also, if there was a cure to the problems with MARCXML (I'm sure we can > > all think of some), what would you suggest to help alleviate the > > problems? > > Sure, I know of a cure! I have come up with a modified marcxml schema, > but as I am investigating SOLR further, I think the solr schema is also > a cure. > > The problem with MARXML is the fact that all of the elements have the > same name and then use the attributes to differentiate them, (excuse my > while I barf) this makes indexing at the XML level very difficult, > especially for NXDBs. I got a concurring agreement from main developers > of both packages (exist, berkeley) in this front. My schema just puts > all of the marc fields into it's own element. Instead of <datafield > code="245">, I created a field called <T245> and instead of all of the > subfields in multiple tags, i just put all of the subfields into one > element. No one needs to search (from my perspective) the subtitle > ("b") separately from the main ("a") title, so I just made a really > simple xml document that is 1/4 the size. By doing this I was able to > take a 45 minute search of marcxml records and reduce it down to results > in 1 second. The main boost was not the reduction in file size, but the > way the indexing works. > > Give it a shot, I promise better results! > > Andrew >