Print

Print


OK, my last reply to a reply today, I swear, for this list at least :-)

>> * the indexer allows control of how the data is normalized
> Is this the indexers job?  I say not.

I probably should have said that "the indexer does not preclude special
normalization", and for that matter, I probably should not have used
"normalized" as a descriptive term at all, since it brings in all sorts of
semantic baggage that is not related. By normalize, I mean controlling how
stemming, and things like accents and special characters are handled, and
by being helpful to this process, I mean that the indexer is flexible, not
that it should necessarily step in and do all this work. There's a really
interesting presentation from Apachecon 2005 about the Center for Natural
Language Processing's (CNLP) work with Lucene
<http://www.cnlp.org/apachecon2005/>, and there is some description of how
they handle multiple languages, so what I probably really meant is
"supporting more of what CNLP is doing", but that would have implied I
understood it much better than I do. Anyway, Lucene does brilliantly on
all this.

>The better granularity to consider is the interface
>to the index, like SRU or Solr's custom interface, etc.  And the
>library world already has these standards in place that could easily
>be put on top of Lucene or Solr.

Well, sure, that's good too, but if I wanted to limit searches to journals
the library subscribes to in, say, the 7 million articles in the Scopus
dataset, I could see combining a Lucene index of that content, or at least
the holding identification part of that content, with a library index, and
then limiting searches to content that is available, or not available, or
both, and so on. In another life-time, I used to work with a system called
SPIRES, which stored indexes in their own databases, and these kind of
combinations were possible even when the Berlin wall was still standing.
The library world has standards for identifying the availability of
individual items, and these rock, but I don't think there is that much for
collection scoping. Of course, if I could get the Scopus content and such
directly, this probably wouldn't matter, but I really wonder if there's
some index construct that could possibly be sharable among all content
providers without round-tripping to check on the status of each item.

Of course, there are probably political/legal issues in all this that make
the technology involved seem trivial...

art