This is pretty good stuff. Consider submitting an article proposal to
Code4Lib Journal about it. :)
Jonathan
Tim McCormick wrote:
> In our office we too have been investigating the e-book material at
> Internet Archive / OCA.
>
> We'd like to build just the sort of OCA index / id-switcher that Tim
> Shearer and others have described on this list -- in order to, among
> other things, add this type of capability to our xID (aka xISBN)
> service, and to WorldCat.
>
> So, I thought I'd report on results so far, and what we're working on.
>
> Data:
> 1) First, we used the Internet Archive's OAI interface to harvest
> brief records of all items categorized as "text". We found that this
> yielded only very brief records, though -- author, title, and OCA
> unique identifier (e.g. "northcarolinayea1910rale").
> 2) Then we used the OCA identifier to check for, and harvest, MARC-XML
> records when available, using the lookup method described by Chris
> Freeland on Code4Lib on Feb 25.
> 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may
> look for other identifiers later).
>
> That yielded:
> - 290,756 total OCA "text" records found
> - 198,826 of those had MARC records
> - 1773 had ISBNs
> - 88537 had OCLC numbers (identified by record position & format,
> but not yet verified against WorldCat).
>
> Switching:
> In xID we currently support ISBN, have recently added LCCN, and we
> plan to release ISSN and OCLCnum support in upcoming releases. So,
> when those are fully phased in, the goal is that you could submit an
> identifier of any supported type, and get back all identifiers of
> whichever type that represent versions of the same "work"; or, when
> appropriate, the same manifestation.
> Therefore, the 88.537 OCLCnums will likely map to a much larger
> set of identifiers over all, allowing a lot of book records -- in
> library catalogs or elsewhere -- to hook into OCA materials.
>
> Free-text service:
> We imagine a service which, given an identifier, attempts to decide if
> a free-text version of the described work is available at OCA/IA: and
> if so, returns an access URL for that resource.
>
> Other work:
> We are investigating the case of free/open resources that lack
> standard identifiers -- for example, possibly, the 2/3 of IA texts for
> which we didn't find OCLCnum or ISBN. Here, we are looking at doing
> "best-guess" lookup of related identifiers, based on author and title
> information in the brief record. This might allow substantially
> broader indexing of open content materials, but the reliability of the
> identifier association is lower.
>
> Any tips, questions, suggestions, requests are welcome.
> thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office
> for work on this.
>
> Tim
>
> --
> Tim McCormick
> Product Manager (xID), OCLC New Jersey
> Email: mccormit (at) oclc.org
> 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA
> Phone: +1.973.868.5694 | Skype: tim_mccormick
> http://www.oclc.org/
>
>
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu
|