I've been out, and so haven't been following this, but I'll try to
explain what xISBN is doing.
The xISBN service runs from a fairly simple file. We generate clusters
of ISBNs based on FRBR work-set groupings and refine them so that each
ISBN is in only a single cluster.
Each ISBN in each cluster of size 2 or more was written to a file,
followed by the rest of the ISBNs in that cluster, sorted by number of
library holdings.
That file is sorted by ISBN, and we currently actually read that file
into memory as one big string. We search the string to find the proper
group using some simple indexes and binary search. Since all the
information is in memory and all that has to be done is to reformat it
into XML, the service is very fast (but sits there with several hundred
megabytes of data).
Hope that's a clear enough explanation of how we're currently doing it.
The main computation is the refinement of the initial clusters. It
takes around an hour to run on our 48-cpu cluster, about as long as the
initial FRBR grouping on WorldCat.
--Th
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Jonathan Gorman
Sent: Tuesday, May 23, 2006 9:54 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Musings on using xISBN in our Horizon catalog
>
> That's why I'd love to know whether the xISBN database uses a common
> identifier for each set of ISBNs, and whether (and I know 'pretty
> please' is a poor justification for changing an API) it might be
exposed
> for this reason.
>
Hopefully the OCLC people can answer that. It might be in the work Andy
suggested yesterday. One idea I had while yesterday was if you don't
care that much about the id internally you could use an auto-increment.
To clarify, we'll assume that any isbn in a set will return the same set
in xISBN.
IE asking for isbns related to a returns a,b and c. Asking for b or c
should return a,b,c.
So we can do as Andy suggested and start building our table by taking
the
set of all current isbns, normalized a bit I'd imagine.
In a computationally-expensive method:
Start with the first isbn (x) and get the set of isbns from xISBN that
is
related (A). Iterate over every member of A testing for the following:
is the member assigned to a group already. If it has, stop the loop and
assign x to the same group. If none in A have been assigned a group,
start a new group and add x.
You'll have to do this every once in a while to make sure you're getting
all the new books.
Hopefully this makes up for the advice I gave yesterday ;). I'm
sure you can probably come up with a better algorithm though,
something about the backward-lookup everytime makes me think that
there's a better way.
ps. Andy's right, normalization is a good, good thing. Only reason I
suggested looking at the costs was I was thinking it would be a lot
easier
than trying to come up with a method to generate unique ids for a
"group"
since my grasp of FRBR/xISBN is a little shaky I'll avoid any specific
terminology.
(Like I said in my original email, having a identifier or groups is a
definite advangtage).
|