This is great stuff. I am interested in what algorithms you are using to group works. It sounds like you are doing that, above what OCA does (which is nothing, I think). Have you gotten that far yet? What are you thinking? Oh wait, you're from OCLC, you guys have already got all sorts of stuff to do that, I guess. Jonathan Tim McCormick wrote: > In our office we too have been investigating the e-book material at > Internet Archive / OCA. > > We'd like to build just the sort of OCA index / id-switcher that Tim > Shearer and others have described on this list -- in order to, among > other things, add this type of capability to our xID (aka xISBN) > service, and to WorldCat. > > So, I thought I'd report on results so far, and what we're working on. > > Data: > 1) First, we used the Internet Archive's OAI interface to harvest > brief records of all items categorized as "text". We found that this > yielded only very brief records, though -- author, title, and OCA > unique identifier (e.g. "northcarolinayea1910rale"). > 2) Then we used the OCA identifier to check for, and harvest, MARC-XML > records when available, using the lookup method described by Chris > Freeland on Code4Lib on Feb 25. > 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may > look for other identifiers later). > > That yielded: > - 290,756 total OCA "text" records found > - 198,826 of those had MARC records > - 1773 had ISBNs > - 88537 had OCLC numbers (identified by record position & format, > but not yet verified against WorldCat). > > Switching: > In xID we currently support ISBN, have recently added LCCN, and we > plan to release ISSN and OCLCnum support in upcoming releases. So, > when those are fully phased in, the goal is that you could submit an > identifier of any supported type, and get back all identifiers of > whichever type that represent versions of the same "work"; or, when > appropriate, the same manifestation. > Therefore, the 88.537 OCLCnums will likely map to a much larger > set of identifiers over all, allowing a lot of book records -- in > library catalogs or elsewhere -- to hook into OCA materials. > > Free-text service: > We imagine a service which, given an identifier, attempts to decide if > a free-text version of the described work is available at OCA/IA: and > if so, returns an access URL for that resource. > > Other work: > We are investigating the case of free/open resources that lack > standard identifiers -- for example, possibly, the 2/3 of IA texts for > which we didn't find OCLCnum or ISBN. Here, we are looking at doing > "best-guess" lookup of related identifiers, based on author and title > information in the brief record. This might allow substantially > broader indexing of open content materials, but the reliability of the > identifier association is lower. > > Any tips, questions, suggestions, requests are welcome. > thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office > for work on this. > > Tim > > -- > Tim McCormick > Product Manager (xID), OCLC New Jersey > Email: mccormit (at) oclc.org > 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA > Phone: +1.973.868.5694 | Skype: tim_mccormick > http://www.oclc.org/ > > -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu