> On May 30, 2006, at 5:34 PM, K.G. Schneider wrote:
> Similarly, exploiting a thesaurus (an authority list, a controlled
> vocabulary, or whatever you want to call it) works in the same way.
> Use it to reformulate and/or suggest alternative queries to be
> applied to the index. Thesauri are not an integral part of the
> indexer, per se, but the interface as a whole. Faceted browsing is
> just fancy way of using controlled vocabularies to access the content
> in the index.

Hmm. Generally I agree with your points, but in the above I say, not

You seem to be assuming a particular model of the search engine, that is
largely based on full-text keyword indexing. The thesaurus you are talking
about is a search thesaurus or an 'end user' thesaurus, which just
suggests (or automatically uses) alternate terms---applied to that full
text keyword search. And in that scenario, "facetted browsing is just a
fancy way of using controlled vocabularies to access the content in the
index"---you seem to be implying that facetted browsing just puts a fancy
interface on top of funtionality that will still be looking up words in an
unstructured keyword index. Yes, in that model.

But that's not the only model.  An 'indexing' thesaurus or controlled
vocabularly can be applied at the time of indexing.  It can be applied by
a machine algorithm, in which case it would certainly be part of the
indexer/searcher.  (There are various (somewhat experimental at this
point) machine classification or clustering algorithms that an indexer
could support, and that are generally beyond the reach of an interface to
provide without indexer support). Or, more traditionally, the indexing
controlled vocabulary can be applied by a human---but even in this case,
you want the assigned terms to be captured in a _structured_ way so
searches can be done on the controlled vocabularly itself. Not mixing
together controlled terms and uncontrolled terms in one big keyword search
(which is in fact what some search products do, including some rather
popular ones).  For that matter, a field-specific storage of terms is
neccesary for any kind of facetted browsing or field-specific search,

A controlled vocabulary that allows multi-word terms requires an
indexer/searcher that will respect them as multi-word terms, instead of
splitting them up into seperate words in a single-word index (I am
thinking of a particular popular digital repository product which has
problems here).  A complex pre-coordinated (semi-)facetted controlled
vocabulary with thesaural relationships (like LCSH) introduces all sorts
of possibilities for browsing, navigation, and expansion/refinement of
searches you want your interface to support (which few do in the case of
LCSH, of course). Your interface can only support it if the indexer/search
layer provides robust enough tools---but even there, it might be a lot
more efficient and easier to support at the indexing stage than by
attempting to use complicated algebraic boolean and stemming expressions
to get at what you really want.

I agree with your general point that the indexer/searcher and the
interface should be divided into seperate layers, decoupled as much as
possible, and we should keep them straight in our heads. But in fact, they
are interdependent, and the dependencies go both ways. The features of the
'lower' level do indeed constrict what can be done at the interface level
(and how conveniently it can be done)--and I'd suggest what we _want_ to
do at the interface level should drive the features of the indexer/search
engine--not the other way around.


> Boolean operators, fielded searching, phrase searching, and relevance
> ranking are a part of indexers/search engines, but just because they
> work against the indexer does not mean you use their particular
> syntax in the user interface. Yet again, you create the interface and
> translate that into the language of the indexer/search engine. (That
> is *really* what Z39.50 and SRW/U are for.) Not the other way around.
> --
> Eric Morgan
> I'm hiring a Senior Programmer Analyst. See http://