Tim - This is awesome work! One thing to be aware of is that IA takes a
non-hierarchical view of scanned books - there is no Title->Item
(Bib->Item) relationship. When they scan a serial or multivolume
monograph the MARCXML file for the Title is deposited in each scanned
Item.
For instance, the MARCXML for "The transactions of the Academy of
Science of St. Louis" is dropped into this item, which is volume 21:
http://www.archive.org/details/transactionsofac21acad
-(Click the FTP link along the left, then the _marc.xml file)
and this item, which is volume 22:
http://www.archive.org/details/transactionsofac22acad
You'll see they are identical files. So, your number of 198,826 MARC
files does not correspond to 198,826 titles. You will need to group
those MARC files by <leader> to get a true count of titles. This is
what BHL does when we ingest materials from
http://www.archive.org/details/biodiversity into
http://www.biodiversitylibrary.org/
Chris
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Tim McCormick
Sent: Wednesday, March 12, 2008 3:58 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
In our office we too have been investigating the e-book material at
Internet Archive / OCA.
We'd like to build just the sort of OCA index / id-switcher that Tim
Shearer and others have described on this list -- in order to, among
other things, add this type of capability to our xID (aka xISBN)
service, and to WorldCat.
So, I thought I'd report on results so far, and what we're working on.
Data:
1) First, we used the Internet Archive's OAI interface to harvest
brief records of all items categorized as "text". We found that this
yielded only very brief records, though -- author, title, and OCA
unique identifier (e.g. "northcarolinayea1910rale").
2) Then we used the OCA identifier to check for, and harvest, MARC-XML
records when available, using the lookup method described by Chris
Freeland on Code4Lib on Feb 25.
3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may
look for other identifiers later).
That yielded:
- 290,756 total OCA "text" records found
- 198,826 of those had MARC records
- 1773 had ISBNs
- 88537 had OCLC numbers (identified by record position & format,
but not yet verified against WorldCat).
Switching:
In xID we currently support ISBN, have recently added LCCN, and we
plan to release ISSN and OCLCnum support in upcoming releases. So,
when those are fully phased in, the goal is that you could submit an
identifier of any supported type, and get back all identifiers of
whichever type that represent versions of the same "work"; or, when
appropriate, the same manifestation.
Therefore, the 88.537 OCLCnums will likely map to a much larger
set of identifiers over all, allowing a lot of book records -- in
library catalogs or elsewhere -- to hook into OCA materials.
Free-text service:
We imagine a service which, given an identifier, attempts to decide if
a free-text version of the described work is available at OCA/IA: and
if so, returns an access URL for that resource.
Other work:
We are investigating the case of free/open resources that lack
standard identifiers -- for example, possibly, the 2/3 of IA texts for
which we didn't find OCLCnum or ISBN. Here, we are looking at doing
"best-guess" lookup of related identifiers, based on author and title
information in the brief record. This might allow substantially
broader indexing of open content materials, but the reliability of the
identifier association is lower.
Any tips, questions, suggestions, requests are welcome.
thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office
for work on this.
Tim
--
Tim McCormick
Product Manager (xID), OCLC New Jersey
Email: mccormit (at) oclc.org
2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA
Phone: +1.973.868.5694 | Skype: tim_mccormick
http://www.oclc.org/
|