I've been out, and so haven't been following this, but I'll try to explain what xISBN is doing. The xISBN service runs from a fairly simple file. We generate clusters of ISBNs based on FRBR work-set groupings and refine them so that each ISBN is in only a single cluster. Each ISBN in each cluster of size 2 or more was written to a file, followed by the rest of the ISBNs in that cluster, sorted by number of library holdings. That file is sorted by ISBN, and we currently actually read that file into memory as one big string. We search the string to find the proper group using some simple indexes and binary search. Since all the information is in memory and all that has to be done is to reformat it into XML, the service is very fast (but sits there with several hundred megabytes of data). Hope that's a clear enough explanation of how we're currently doing it. The main computation is the refinement of the initial clusters. It takes around an hour to run on our 48-cpu cluster, about as long as the initial FRBR grouping on WorldCat. --Th -----Original Message----- From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Jonathan Gorman Sent: Tuesday, May 23, 2006 9:54 AM To: [log in to unmask] Subject: Re: [CODE4LIB] Musings on using xISBN in our Horizon catalog > > That's why I'd love to know whether the xISBN database uses a common > identifier for each set of ISBNs, and whether (and I know 'pretty > please' is a poor justification for changing an API) it might be exposed > for this reason. > Hopefully the OCLC people can answer that. It might be in the work Andy suggested yesterday. One idea I had while yesterday was if you don't care that much about the id internally you could use an auto-increment. To clarify, we'll assume that any isbn in a set will return the same set in xISBN. IE asking for isbns related to a returns a,b and c. Asking for b or c should return a,b,c. So we can do as Andy suggested and start building our table by taking the set of all current isbns, normalized a bit I'd imagine. In a computationally-expensive method: Start with the first isbn (x) and get the set of isbns from xISBN that is related (A). Iterate over every member of A testing for the following: is the member assigned to a group already. If it has, stop the loop and assign x to the same group. If none in A have been assigned a group, start a new group and add x. You'll have to do this every once in a while to make sure you're getting all the new books. Hopefully this makes up for the advice I gave yesterday ;). I'm sure you can probably come up with a better algorithm though, something about the backward-lookup everytime makes me think that there's a better way. ps. Andy's right, normalization is a good, good thing. Only reason I suggested looking at the costs was I was thinking it would be a lot easier than trying to come up with a method to generate unique ids for a "group" since my grasp of FRBR/xISBN is a little shaky I'll avoid any specific terminology. (Like I said in my original email, having a identifier or groups is a definite advangtage).