OK, my last reply to a reply today, I swear, for this list at least :-) >> * the indexer allows control of how the data is normalized > Is this the indexers job? I say not. I probably should have said that "the indexer does not preclude special normalization", and for that matter, I probably should not have used "normalized" as a descriptive term at all, since it brings in all sorts of semantic baggage that is not related. By normalize, I mean controlling how stemming, and things like accents and special characters are handled, and by being helpful to this process, I mean that the indexer is flexible, not that it should necessarily step in and do all this work. There's a really interesting presentation from Apachecon 2005 about the Center for Natural Language Processing's (CNLP) work with Lucene <http://www.cnlp.org/apachecon2005/>, and there is some description of how they handle multiple languages, so what I probably really meant is "supporting more of what CNLP is doing", but that would have implied I understood it much better than I do. Anyway, Lucene does brilliantly on all this. >The better granularity to consider is the interface >to the index, like SRU or Solr's custom interface, etc. And the >library world already has these standards in place that could easily >be put on top of Lucene or Solr. Well, sure, that's good too, but if I wanted to limit searches to journals the library subscribes to in, say, the 7 million articles in the Scopus dataset, I could see combining a Lucene index of that content, or at least the holding identification part of that content, with a library index, and then limiting searches to content that is available, or not available, or both, and so on. In another life-time, I used to work with a system called SPIRES, which stored indexes in their own databases, and these kind of combinations were possible even when the Berlin wall was still standing. The library world has standards for identifying the availability of individual items, and these rock, but I don't think there is that much for collection scoping. Of course, if I could get the Scopus content and such directly, this probably wouldn't matter, but I really wonder if there's some index construct that could possibly be sharable among all content providers without round-tripping to check on the status of each item. Of course, there are probably political/legal issues in all this that make the technology involved seem trivial... art