One other variation with lucene is to use a relational database underneath of cocoon and index a view of the content that pulls out the XML in the blob and any other data in the database tables that fits. I think this would let you use cocoon's scheduler to keep the index up to date, use database pooling and caching for throughput, and insert other kinds of content into the pipeline if it made sense, e.g. comments from a website. It used to be 30 to 40% slower to deliver images from mysql as a blob than directly from disk, which might argue for the need for pooling and caching for whatever blob-like field holds something like EAD content, though I haven't seen figures on this in a long time and network latency probably obliterates all other factors anyway. art