At a full bib page (the ultimate destination of most successful catalog
searches here), we want to be able to display links to other versions of
the same work (audiobooks, large print, possibly translations). OCLC's
xISBN service gives us something like the raw links we need, but we want
to refine the process for a number of reasons.
First, we don't want to incur an OCLC hit for every full-bib page view.
'Nuff said.
Second, we want to avoid doing unnecessary calculations for every page
hit -- almost all of the numbers returned by the xISBN service will not
match anything in our catalog.
So the ideal use case for our xISBN cache is that we would be querying
only a local database, and that database would only return ISBNs (or bib
numbers) of other editions which are actually in our catalog.
There is a possibility that we will add new versions of an existing
item, so we should be prepared to verify periodically that the xISBN
cache is accurate. But for the most part, we should be able to do a
local query that will quickly return linking information (ISBN or
Horizon bib) that will identify every version in our catalog. (And
ideally, these versions would be summarized as "large print", "abridged
audiobook", etc. as on Amazon.)
It doesn't much matter how much storage we use on the local database.
At 16 bytes or less per ISBN, a database of 700,000 items would only run
to about 12MB. But if we cross-reference every ISBN to every other,
we'll have a factorial number of rows, which is probably less than
optimal. On the other hand, if we do a histogram and find that 90% of
all items have no other versions in the catalog, perhaps this would be
an acceptable solution.
My guess is that we should have a row for each ISBN in the system, along
with a column that links that ISBN to some common identifier that will
symbolically mean e.g. "SQL for Dummies, any version". We could then
ask "What is the identifier for the first ISBN of the item being
displayed?" and then do a second query that asks "What other ISBNs in
our catalog have that identifier?"
I'm most concerned with choosing a good identifier. If there's a way to
uniquely identify a FRBR work, then surely that identifier would be the
best key to use. If not, perhaps someone could suggest something?
The only remaining question, I think, is whether we should have one
table with ISBN as primary key and identifier as a second column (one
row per ISBN), and then another table with identifier as primary key and
ISBN as second column (again, one row per ISBN). This could make our
queries faster at the cost of doubling storage space. On the other
hand, perhaps we don't need to optimize that far. It depends on how
mySQL reacts to searching the entire table for a non-primary-key column
value. I'll have to fiddle with it and find out, unless one of you
knows already.
Then the next issue is how to inject that into the PAC, but DaveyP used
Ajax for that, so I think we're OK on that.
Any thoughts, folks?
Ben Ostrowsky
|