This is why I think we should figure out smart ways to manage facets
independently of Lucene index fields. Solr populates a facet by setting
up a bitset for every value found in a given index field, and it uses
those bitsets to filter query result sets by deriving an intersection
set. We can extend that functionality by populating and maintaining
bitsets based on external data sources that can map to Lucene document
ids. This allows us to update the bitset (relatively cheap) without
having to update the index (relatively expensive).
We could use this for those attributes that change relatively often,
like circ status or user-applied tags (or even full-text in Roy's
environment). When something like this changes, we look up the Lucene
document id and add it to or delete it from the relevant bitset. We
might also update a quick external datastore like a MySQL db that's
dedicated to handling these dynamic facets, so we can rebuild the facet
from scratch when we need to. That way we avoid having to refetch and
reindex the bib record into Lucene via Solr every time a dynamic
attribute changes (since you can't update a single field in a Lucene
index).
I'm assuming that those frequent updates to the Lucene index are enough
overhead to be worth avoiding; that will need to be confirmed by
practice. My experience with Solr is in a project where we're indexing
full text along with bib metadata, so reindexing (potentially hundreds
of pages of text) is something we definitely want to avoid. How do
people expect this to play out with bib records without full text?
Peter
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Roy Tennant
Sent: Friday, January 19, 2007 11:29 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Limiting by availability (was Re: [CODE4LIB]
Getting data from Voyager into XML?)
On 1/19/07 9:26 AM, "Steve Toub" <[log in to unmask]> wrote:
> Also, as a possible sweet-spot, I'm wondering if its practical to do
> post-search winnowing by availability after doing the FCLA-style
> real-time query, by doing indexing on the fly of the responses from
> the real-time queries for that particular search.
Interesting idea if done on a screen-by-screen basis. That is, you
simply don't display to the user those that aren't available. I've
thought about this same strategy for a "full-text" filter. That is, you
bring back all the results, but if the user only wants items that have
full-text available, you filter out those that don't as you build the
screen display. This of course has a hit on response time, but with APIs
that allow multiple-item lookups, it is at least not as bad as it could
be.
Roy
|