LISTSERV 16.5 - CODE4LIB Archives

On Nov 28, 2006, at 3:28 PM, Andrew Nagy wrote:
> The major problem
> with it all is the ugly mess that is marcxml

This brings up an interesting point about just dropping our source
XML data into an XML-savvy database and using XQuery on it.

Maybe y'all have much cleaner data that I've seen, but my experience
with Rossetti Archive has had many XML data hurdles.  When I came on
board, Tamino was being used for the "search engine", with XPath
queries all over the place.  The raw data is not consistent, and a
single word query expanded into an enormous XPath query to look at
many elements and attributes, not to mention it was SLOW.  Analyzing
the user interface and the real-world searching needs, I wrote Java
code that normalized the data for searching purposes into a much
courser grained set of fields, indexing it into Lucene, and voila:
http://www.rossettiarchive.org/rose

The point is that even with super fast full-text searching with
XQuery, most of our archives are probably going to require hideous
expressions to query them using their raw structure, especially if
have to account for data cleanup too (such as date formatting issues,
which we also have in RA raw data).

I realize I'm sounding anti-XQuery, which is sorta true, but only
because in the real-world in which I work it works better to have
some custom digesting of the raw data than to just toss it in and
work with standards.  Indexing is lossy - it's about keying things
the way they need to be looked up.  If your data is clean, you're in
better shape than me.  And if XQuery on your raw data does what you
need, by all means I recommend it.

        Erik