LISTSERV 16.5 - CODE4LIB Archives

On Nov 29, 2006, at 10:27 AM, Art Rhyno wrote:
> I am so behind in e-mail that I might be treading on ground that is
> worn
> out on this, but I would add to Eric's list that I don't care about
> the
> indexer if:

Here's how Lucene/Solr fares on these points:

> * the indexer has an open and configurable relevancy weighting
> algorithm

Adjusting relevancy with Lucene is configurable in a number of ways,
boosts and tweaking Similarity.

> * the indexer allows control of how the data is normalized

Is this the indexers job?  I say not.   Sure, we'd all love to have
everything including the kitchen sink hidden behind some drag-and-
drop interface, but it really isn't Solr's job to clean up data.  I'm
not quite sure what you mean by normalized though, so maybe I'm off
base?

> * the indexer uses pluggable parsers

Solr doesn't know MARC from Adam.

Again, it isn't Solr's job to parse MARCXML, I'd argue.  It's a full-
text search engine, and overloading it to be more than that is asking
for trouble later when you do want to swap things out.

Maybe you mean tokenization rather than parsing though, in which case
Solr and Lucene certainly have great configurability.

> * the indexer supports very fast retrieval

:)   But of course!

> then, on the preferred side:
>
> * the indexer allows the index process to effectively leverage
> commodity
> hardware

The beefier the better.

> * the indexer creates an index that can be combined with others

Solr may eventually federate with other Solr instances - that is on
the TODO list.  And there was recently a message from someone adding
an SRU/SRW interface to it.

> One of our most common comments when we do
> surveys of our user community is "don't show me what you can't deliver
> NOW". A world class indexer opens the door for scoping at the
> collection
> level, there doesn't have to be one solution for IR and it would be
> a very
> unhealthy ecosystem without variance, but I suspect it would be
> easier to
> convince a company like Elsevier that I want a lucene index for
> licensed
> content than almost any other technology offering. So a definite
> "yes" to
> SRU, OpenURL,  Z39.50, and the rest, but I wonder if sustaining a
> lucene
> index is a good idea regardless of what the main building blocks for a
> library's preferred IR layer turn out to be. Library standards
> don't tend
> to delve into the architecture of indexing anyway, but this is really
> where a lot of what can be delivered gets defined.

We discussed this in Windsor, but for everyone else's benefit I
personally don't think sharing a Lucene "index" is the right
granularity to work with.  The specifics of the index format evolve
with each new version of Lucene (with backwards compatibility in
mind, for sure).  The better granularity to consider is the interface
to the index, like SRU or Solr's custom interface, etc.  And the
library world already has these standards in place that could easily
be put on top of Lucene or Solr.

        Erik