LISTSERV 16.5 - CODE4LIB Archives

On Wed, 15 Dec 2004 14:28:51 -0500, Clay Redding <[log in to unmask]> wrote:
> Hi Eric,
>
> Not necessarily.  If you're up for trying PostgreSQL, their XML
> functionality works *really* well.
>
> http://www.throwingbeans.org/tech/postgresql_and_xml.html
[...]
> Eric Lease Morgan wrote:
>
> > Again, thank you for the prompt reply.
> >
> > Actually, the implementation you suggest was the path I was
> > considering. Unfortunately, this means leaving some sort of text lying
> > around on my file system for importing. Similarly, it poses the problem
> > of editing; in order to edit data in such an implementation I will need
> > to export the big chunk, edit, and re-import. That is sort of klunky,
> > but still it is what I was considering.

Let me also mention eXist[1] as an XML database that supports XPath
and XQuery.  Throwing Beans mentions some of eXist's advantages in a
posting from September [2].

In my mind, searching large swaths of full-text is rather different
than searching structured metadata, especially on controlled
vocabulary index points.  One idea that comes to mind is using either
the Throwing Beans/PostgreSQL or eXist solutions for storage and
XPath-based access, and creating full-text indices with Lucene[3].  (I
mention Lucene because I recall reading a few months back that eXist
and Sleepycat were optimized for XPath queries and so might not
perform adequately in full-text searches.)  In developing a Lucene
index of a corpus of TEI texts, I suspect that for each indexed chunk
(chapter, section, paragraph) you could include the XPath expression
pointing to that chunk as a thing that could be returned.  Do
full-text searches against Lucene, but pull actual text to be
transformed from eXist, using XPath expressions returned from Lucene.
Browsable representations of document structures would, I think, also
be easier to get from eXist or PgSQL/Throwing Beans than from Lucene.

You could probably just store the documents in eXist, and spit them
out temporarily to let Lucene index them.  I am not sure that Lucene
would permit you to selectively re-index corrected documents, though.
You might have spit out and re-index all the docs after posting a
corrected version back to eXist.

I suspect that a system like this could also be made to work for
collections of EAD finding aids, though perhaps they have less need of
heavy-duty full-text searching than literary or philosophical texts.

Just a few thoroughly untried ideas!
Chuck

[1] <http://exist.sourceforge.net/>
[2] <http://www.throwingbeans.org/tech/xml_databases_with_exist_and_coldfusion.html#000048>
[3] <http://jakarta.apache.org/lucene/docs/index.html>