Print

Print


Yup,

Chris' email was exactly what I was hoping for.  Now if there were a nice
way to pre-screen for records that don't have empty (isbn|issn|oclc#)
without all the work of looking per record (and the overhead for the
server, and the overhead if more than one organization starts to do this).

I guess I want to search for uniqueID != NULL and only get their unique id
back, and script from there.

Still and all, this now seems a very doable thing.

Chris, many thanks!
-t

On Mon, 25 Feb 2008, Tennant,Roy wrote:

> Well, from where Chris left off it would be fairly easy to check for a
> file in the directory with an "marc.xml" filename extension, then XSLT
> for:
>
> <datafield tag="010" ind1=" " ind2=" ">
> <subfield code="a">39004822</subfield>
> </datafield>
>
> If such exists, and then you'll have the ISBN. To sweeten it further,
> send that into xISBN or ThingISBN and get other ISBNs for the same work.
> This seems completely scriptable to me. Perhaps someone at c4l will have
> it done before the conference is over. And Tim, the example above is one
> that's in your catalog.
> Roy
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Chris Freeland
> Sent: Monday, February 25, 2008 11:51 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] oca api?
>
> Steve & Tim,
>
> I'm the tech director for the Biodiversity Heritage Library (BHL), which
> is a consortium of 10 natural history libraries who have partnered with
> Internet Archive (IA)/OCA for scanning our collections.  We've just
> launched our revamped portal, complete with more than 7,500 books & 2.8
> million pages scanned by IA & other digitization partners, at:
> http://www.biodiversitylibrary.org
>
> To build this portal we ingest metadata from IA.  We found their OAI
> interface to pull scanned items inconsistently based on date of
> scanning, so we switched to using their custom query interface.  Here's
> an example of a query we fire off:
>
> http://www.archive.org/services/search.php?query=collection:(biodiversit
> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
> OI%20Library)&limit=10&submit=submit
>
> This is returning scanned items from the "biodiversity" collection,
> updated between 10/31/2007 - 11/30/2007, restricted to one of our
> contributing libraries (MBLWHOI Library), and limited to 10 results.
>
> The results are styled in the browser; view source to see the good
> stuff.  We use this list to grab the identifiers we've yet to ingest.
>
> Some background: When a book is scanned through IA/OCA scanning, they
> create their own unique identifier (like "annalesacademiae21univ") and
> grab a MARC record from the contributing library's catalog.  All of the
> scanned files, derivatives, and metadata files are stored on IA's
> clusters in a directory named with the identifier.
>
> Steve mentioned using their /details/ directive, then sniffing the page
> to get the cluster location and the files for downloading.  An easier
> method is to use their /download/ directive, as in:
>
> http://www.archive.org/download/ID$, or in the example above:
> http://www.archive.org/download/annalesacademiae21univ
>
> That automatically does a lookup on the cluster, which means you don't
> have to scrape info off pages.  You can also address any files within
> that directory, as in:
> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
> 1univ_marc.xml
>
> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
> these scanned books is to grab them out of the MARC record.  So the
> long-winded answer to your question, Tim, is no, there's no simple way
> to crossref what IA has scanned with your catalog - THAT I KNOW OF.  Big
> caveat on that last part.
>
> Happy to help with any other questions I can,
>
> Chris Freeland
>
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Steve Toub
> Sent: Sunday, February 24, 2008 11:20 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] oca api?
>
> --- Tim Shearer <[log in to unmask]> wrote:
>
>> Hi Folks,
>>
>> I'm looking into tapping the texts in the Open Content Alliance.
>>
>> A few questions...
>>
>> As near as I can tell, they don't expose (perhaps even store?) any
> common
>> unique identifiers (oclc number, issn, isbn, loc number).
>
> I poked around in this world a few months ago in my previous job at
> California Digital Library, also an OCA partner.
>
> The unique key seems to be text string identifier (one that seems to be
> completely different from the text string identifier in Open Library).
> Apparently there was talk at the last partner meeting about moving to
> ISBNs:
> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
> lliance/
>
> To obtain identifiers in bulk, I think the recommended approach is the
> OAI-PMH interface, which seems more reliable in recent months:
>
> http://www.archive.org/services/oai.php?verb=Identify
>
> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
> fix=oai_dc&set=collection:cdl
>
> etc.
>
>
> Additional instructions if you want to grab the content files.
>
>> From any book's metadata page (e.g.,
> http://www.archive.org/details/chemicallecturee00newtrich)
> click through on the "Usage Rights: See Terms" link; the rights are on a
> pane on the left-hand side.
>
> Once you know the identifier, you can grab the content files, using this
> syntax:
>    http://www.archive.org/details/$ID
> Like so:
>    http://www.archive.org/details/chemicallecturee00newtrich
>
> And then sniff the page to find the FTP link:
>    ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>
> But I think they prefer to use HTTP for these, not the FTP, so switch
> this to:
>    http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>
> Hope this helps!
>
>  --SET
>
>
>> We're a contributer so I can use curl to grab our records via http
> (and
>> regexp my way to our local catalog identifiers, which they do
>> store/expose).
>>
>> I've played a bit with the z39.50 interface at indexdata
>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
> the
>> content behind it.  I get very limited results, for instance I can't
> find
>> any UNC records and we're fairly new to the game.
>>
>> Again, I'm looking for unique identifiers in what I can get back and
> it's
>> slim pickings.
>>
>> Anyone cracked this nut?  Got any life lessons for me?
>>
>> Thanks!
>> Tim
>>
>> +++++++++++++++++++++++++++++++++++++++++++
>> Tim Shearer
>>
>> Web Development Coordinator
>> The University Library
>> University of North Carolina at Chapel Hill [log in to unmask]
>> 919-962-1288
>> +++++++++++++++++++++++++++++++++++++++++++
>>
>