LISTSERV 16.5 - CODE4LIB Archives

On 11/28/06, Erik Hatcher <[log in to unmask]> wrote:

> Is there a standard for specifying how textual analysis works as
> well, so that tokenization can be standardized across these XQuery
> engines as well?

Not that I know.  What I've seen so far is that tokenization is
implementation specific.  Perhaps this is something that is
configurable so that implementations can be set up and then queried
consistently.  Any indexing engine worth its salt should be
configurable I'd think.  There is nothing I'm aware of in the fulltext
work though that defines how things are indexed.

> That's an easy bet... of course Lucene will be part of it.  It's
> already implemented as extensions to XQuery engines (Nux, I know of,
> and surely others).

As you can tell, I'm not really a gambler :-)

Our native XML database vendor has committed to the fulltext spec
(once it becomes a spec) and since they are using Lucene already I'd
say I don't have anything to worry about.

Interestingly, as a side note, a quick search turned up an eXist
presentation from Prague06 saying that eXist's text analysis classes
would be replaced by a "modular analyzer provided by Apache's Lucene."
 Neat.

All this talk is just me looking forward (with optimism).  It is
possible to use fulltext with XQuery now either through an
intermediary layer like we currently have (Lucene search is done and
the results passed to XQuery and our native XML database for retrieval
and munging) or by creating fulltext extensions (like eXist db and our
native XML database vendor have done).

Personally, I wish we had taken the extension route, but it was just
quicker for me to do something in Java and have the search and XQ
servlets chain rather than adding the extra extension layer through
our XQuery processor.  Quicker isn't always better/cleaner/nicer
though...

Kevin