One other variation with lucene is to use a relational database underneath
of cocoon and index a view of the content that pulls out the XML in the
blob and any other data in the database tables that fits. I think this
would let you use cocoon's scheduler to keep the index up to date, use
database pooling and caching for throughput, and insert other kinds of
content into the pipeline if it made sense, e.g. comments from a website.
It used to be 30 to 40% slower to deliver images from mysql as a blob than
directly from disk, which might argue for the need for pooling and caching
for whatever blob-like field holds something like EAD content, though I
haven't seen figures on this in a long time and network latency probably
obliterates all other factors anyway.
art
|