Print

Print


Well, from where Chris left off it would be fairly easy to check for a
file in the directory with an "marc.xml" filename extension, then XSLT
for:

 <datafield tag="010" ind1=" " ind2=" ">
<subfield code="a">39004822</subfield>
</datafield>

If such exists, and then you'll have the ISBN. To sweeten it further,
send that into xISBN or ThingISBN and get other ISBNs for the same work.
This seems completely scriptable to me. Perhaps someone at c4l will have
it done before the conference is over. And Tim, the example above is one
that's in your catalog.
Roy

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Chris Freeland
Sent: Monday, February 25, 2008 11:51 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] oca api?

Steve & Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL), which
is a consortium of 10 natural history libraries who have partnered with
Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books & 2.8
million pages scanned by IA & other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.  Here's
an example of a query we fire off:

http://www.archive.org/services/search.php?query=collection:(biodiversit
y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
OI%20Library)&limit=10&submit=submit

This is returning scanned items from the "biodiversity" collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like "annalesacademiae21univ") and
grab a MARC record from the contributing library's catalog.  All of the
scanned files, derivatives, and metadata files are stored on IA's
clusters in a directory named with the identifier.

Steve mentioned using their /details/ directive, then sniffing the page
to get the cluster location and the files for downloading.  An easier
method is to use their /download/ directive, as in:

http://www.archive.org/download/ID$, or in the example above:
http://www.archive.org/download/annalesacademiae21univ

That automatically does a lookup on the cluster, which means you don't
have to scrape info off pages.  You can also address any files within
that directory, as in:
http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
1univ_marc.xml

The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
these scanned books is to grab them out of the MARC record.  So the
long-winded answer to your question, Tim, is no, there's no simple way
to crossref what IA has scanned with your catalog - THAT I KNOW OF.  Big
caveat on that last part.

Happy to help with any other questions I can,

Chris Freeland


-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Steve Toub
Sent: Sunday, February 24, 2008 11:20 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] oca api?

--- Tim Shearer <[log in to unmask]> wrote:

> Hi Folks,
>
> I'm looking into tapping the texts in the Open Content Alliance.
>
> A few questions...
>
> As near as I can tell, they don't expose (perhaps even store?) any
common
> unique identifiers (oclc number, issn, isbn, loc number).

I poked around in this world a few months ago in my previous job at
California Digital Library, also an OCA partner.

The unique key seems to be text string identifier (one that seems to be
completely different from the text string identifier in Open Library).
Apparently there was talk at the last partner meeting about moving to
ISBNs:
http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
lliance/

To obtain identifiers in bulk, I think the recommended approach is the
OAI-PMH interface, which seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
fix=oai_dc&set=collection:cdl

etc.


Additional instructions if you want to grab the content files.

From any book's metadata page (e.g.,
http://www.archive.org/details/chemicallecturee00newtrich)
click through on the "Usage Rights: See Terms" link; the rights are on a
pane on the left-hand side.

Once you know the identifier, you can grab the content files, using this
syntax:
    http://www.archive.org/details/$ID
Like so:
    http://www.archive.org/details/chemicallecturee00newtrich

And then sniff the page to find the FTP link:
    ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

But I think they prefer to use HTTP for these, not the FTP, so switch
this to:
    http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

Hope this helps!

  --SET


> We're a contributer so I can use curl to grab our records via http
(and
> regexp my way to our local catalog identifiers, which they do
> store/expose).
>
> I've played a bit with the z39.50 interface at indexdata
> (http://www.indexdata.dk/opencontent/), but I'm not confident about
the
> content behind it.  I get very limited results, for instance I can't
find
> any UNC records and we're fairly new to the game.
>
> Again, I'm looking for unique identifiers in what I can get back and
it's
> slim pickings.
>
> Anyone cracked this nut?  Got any life lessons for me?
>
> Thanks!
> Tim
>
> +++++++++++++++++++++++++++++++++++++++++++
> Tim Shearer
>
> Web Development Coordinator
> The University Library
> University of North Carolina at Chapel Hill [log in to unmask]
> 919-962-1288
> +++++++++++++++++++++++++++++++++++++++++++
>