LISTSERV 16.5 - CODE4LIB Archives

Great idea, Tim!

The open library tech list that Bess mentions is [log in to unmask],
described at
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech

-Jodi

Jodi Schneider
Science Library Specialist
Amherst College
413-542-2076

>-----Original Message-----
>From: Code for Libraries [mailto:[log in to unmask]] On
>Behalf Of Tim Shearer
>Sent: Thursday, March 06, 2008 8:47 AM
>To: [log in to unmask]
>Subject: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
>
>Howdy folks,
>
>I've been playing and thinking.  I'd like to have what amounts
>to a unique
>identifier index to oca digitized texts.  I want to be able to
>pull all the
>records that have oclc numbers, issns, isbns, etc.  I want it to be
>lightweight, fast, searchable.
>
>Would anyone else want/use such a thing?
>
>I'm thinking about building something like this.
>
>If I do, it would be ideal if wouldn't be a duplication of
>effort, so anyone
>got this in the works?  And if it would meet the needs of others.
>
>My basic notion is to crawl the site (starting with
>"americana", the American
>Libraries.  Pull the oca unique identifier (e.g.
>northcarolinayea1910rale) and
>associate it with
>
>unique identifiers (oclc numbers, issns, isbns, lc numbers)
>contributing institution's alias and unique catalog identifier
>upload date
>
>That's all I was thinking of.  Then there's what you might be
>able to do with
>it:
>
>        Give me all the oca unique identifiers that have oclc numbers
>        Give me all the oca unique identifiers with isbns that were
>                uploaded between x and y date
>        Give me the oca unique identifier for this oclc number
>
>Planning to do:
>
>        keep crawling it and keep it up to date.
>
>Things I wasn't planning to do:
>
>        worry about other unique ids (you'd have to go to xISBN or
>                ThingISBN yourself)
>        worry about storing anything else from oca.
>
>It would be good for being able to add an 856 to matches in
>your catalog. It
>would not be good for grabbing all marc records for all of oca.
>
>Anyhow, is this duplication of effort?  Would you like
>something like this?
>What else would you like it to do (keeping in mind this is an
>unfunded pet
>project)?  How would you want to talk to it?  I was thinking
>of a web service,
>but hadn't thought too much about how to query it or how I'd
>deliver results.
>
>Of course I'm being an idiot and trying out new tools at the
>same time (python
>to see what the buzz is all about, sqlite just to learn it (it
>may not work
>out)).
>
>Thoughts?  Vicious criticism?
>
>-t
>
>
>On Tue, 26 Feb 2008, Chris Freeland wrote:
>
>> My guess is that, yes, the query interface we've been discussing here
>> and the 'all sorts of interfaces that none of us knew about' are the
>> same.  It's not documented that I'm aware of.  We've found
>out about it
>> by literally sitting next to IA developers and asking questions.
>>
>> Chris
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]]
>On Behalf Of
>> Jonathan Rochkind
>> Sent: Tuesday, February 26, 2008 12:18 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] oca api?
>>
>> So in answer to my question here at the Code4Lib conference, after
>> Brewster's keynote, Brewster suggests there are all sorts of
>interfaces
>> that none of us knew about. Or at least I didn't know about,
>and haven't
>> been able to figure out in months of trying!  I'm going to try and
>> corner him and ask for an email of who we should contact.
>>
>> Perhaps it's the XML interface that you guys know about
>already. Is that
>> documented anywhere? How the heck did you find out about it?
>>
>> Jonathan
>>
>>
>>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
>> I'll add that when IA told me about
>> http://www.archive.org/services/search.php interface to return
>> XML, they asked that we not send more than 100 records at time since
>> doing more would adversely
>> affect production services. Which made it seem like OAI-PMH
>was a better
>> way to go.
>>
>> Chris, can you explain a bit more about what this means: "We
>found their
>> OAI interface to pull
>> scanned items inconsistently based on date of scanning...."?
>I'm having
>> trouble parsing.
>>
>>
>>   --SET
>>
>>
>>
>>
>> --- Chris Freeland <[log in to unmask]> wrote:
>>
>>> Jonathan - No, I don't believe it's documented - at least
>not anywhere
>>> publicly.  If any IA/OCA folks are lurking, here's an opportunity to
>>> make a bunch of techies happy...
>>>
>>> Chris
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>> Of
>>> Jonathan Rochkind
>>> Sent: Monday, February 25, 2008 2:48 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] oca api?
>>>
>>> I hadn't known this "custom query interface" existed! This
>is welcome
>>> news. Is this documented anywhere?
>>>
>>> Jonathan
>>>
>>>
>>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
>>> Steve & Tim,
>>>
>>> I'm the tech director for the Biodiversity Heritage Library (BHL),
>> which
>>> is a consortium of 10 natural history libraries who have partnered
>> with
>>> Internet Archive (IA)/OCA for scanning our collections.  We've just
>>> launched our revamped portal, complete with more than 7,500 books &
>> 2.8
>>> million pages scanned by IA & other digitization partners, at:
>>> http://www.biodiversitylibrary.org
>>>
>>> To build this portal we ingest metadata from IA.  We found their OAI
>>> interface to pull scanned items inconsistently based on date of
>>> scanning, so we switched to using their custom query interface.
>> Here's
>>> an example of a query we fire off:
>>>
>>>
>>
>http://www.archive.org/services/search.php?query=collection:(bi
odiversit
>>>
>>
>y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contribut
or:(MBLWH
>>> OI%20Library)&limit=10&submit=submit
>>>
>>> This is returning scanned items from the "biodiversity" collection,
>>> updated between 10/31/2007 - 11/30/2007, restricted to one of our
>>> contributing libraries (MBLWHOI Library), and limited to 10 results.
>>>
>>> The results are styled in the browser; view source to see the good
>>> stuff.  We use this list to grab the identifiers we've yet
>to ingest.
>>>
>>> Some background: When a book is scanned through IA/OCA
>scanning, they
>>> create their own unique identifier (like
>"annalesacademiae21univ") and
>>> grab a MARC record from the contributing library's catalog.  All of
>> the
>>> scanned files, derivatives, and metadata files are stored on IA's
>>> clusters in a directory named with the identifier.
>>>
>>> Steve mentioned using their /details/ directive, then sniffing the
>> page
>>> to get the cluster location and the files for downloading.
>An easier
>>> method is to use their /download/ directive, as in:
>>>
>>> http://www.archive.org/download/ID$, or in the example above:
>>> http://www.archive.org/download/annalesacademiae21univ
>>>
>>> That automatically does a lookup on the cluster, which
>means you don't
>>> have to scrape info off pages.  You can also address any
>files within
>>> that directory, as in:
>>>
>>
>http://www.archive.org/download/annalesacademiae21univ/annalesa
>cademiae2
>>> 1univ_marc.xml
>>>
>>> The only way to get standard identifiers (ISBN, ISSN, OCLC,
>LCCN) for
>>> these scanned books is to grab them out of the MARC record.  So the
>>> long-winded answer to your question, Tim, is no, there's no
>simple way
>>> to crossref what IA has scanned with your catalog - THAT I KNOW OF.
>> Big
>>> caveat on that last part.
>>>
>>> Happy to help with any other questions I can,
>>>
>>> Chris Freeland
>>>
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>> Of
>>> Steve Toub
>>> Sent: Sunday, February 24, 2008 11:20 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] oca api?
>>>
>>> --- Tim Shearer <[log in to unmask]> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> I'm looking into tapping the texts in the Open Content Alliance.
>>>>
>>>> A few questions...
>>>>
>>>> As near as I can tell, they don't expose (perhaps even store?) any
>>> common
>>>> unique identifiers (oclc number, issn, isbn, loc number).
>>>
>>> I poked around in this world a few months ago in my previous job at
>>> California Digital Library,
>>> also an OCA partner.
>>>
>>> The unique key seems to be text string identifier (one that seems to
>> be
>>> completely different from
>>> the text string identifier in Open Library). Apparently
>there was talk
>>> at the last partner meeting
>>> about moving to ISBNs:
>>>
>>
>http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-
>content-a
>>> lliance/
>>>
>>> To obtain identifiers in bulk, I think the recommended
>approach is the
>>> OAI-PMH interface, which
>>> seems more reliable in recent months:
>>>
>>> http://www.archive.org/services/oai.php?verb=Identify
>>>
>>>
>>
>http://www.archive.org/services/oai.php?verb=ListIdentifiers&me
tadataPre
>>> fix=oai_dc&set=collection:cdl
>>>
>>> etc.
>>>
>>>
>>> Additional instructions if you want to grab the content files.
>>>
>>> From any book's metadata page (e.g.,
>>> http://www.archive.org/details/chemicallecturee00newtrich)
>>> click through on the "Usage Rights: See Terms" link; the
>rights are on
>> a
>>> pane on the left-hand
>>> side.
>>>
>>> Once you know the identifier, you can grab the content files, using
>> this
>>> syntax:
>>>     http://www.archive.org/details/$ID
>>> Like so:
>>>     http://www.archive.org/details/chemicallecturee00newtrich
>>>
>>> And then sniff the page to find the FTP link:
>>>     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>
>>> But I think they prefer to use HTTP for these, not the FTP,
>so switch
>>> this to:
>>>
>http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>
>>> Hope this helps!
>>>
>>>   --SET
>>>
>>>
>>>> We're a contributer so I can use curl to grab our records via http
>>> (and
>>>> regexp my way to our local catalog identifiers, which they do
>>>> store/expose).
>>>>
>>>> I've played a bit with the z39.50 interface at indexdata
>>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
>>> the
>>>> content behind it.  I get very limited results, for
>instance I can't
>>> find
>>>> any UNC records and we're fairly new to the game.
>>>>
>>>> Again, I'm looking for unique identifiers in what I can
>get back and
>>> it's
>>>> slim pickings.
>>>>
>>>> Anyone cracked this nut?  Got any life lessons for me?
>>>>
>>>> Thanks!
>>>> Tim
>>>>
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>> Tim Shearer
>>>>
>>>> Web Development Coordinator
>>>> The University Library
>>>> University of North Carolina at Chapel Hill
>>>> [log in to unmask]
>>>> 919-962-1288
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>
>>
>