Print

Print


Because the IA hasn't devoted resources to documenting this stuff, I
guess.  If they actually want their stuff to be used by folks like us,
then seems to me resources devoted to such would be resources well spent.

Jonathan

K.G. Schneider wrote:
> But why are there hurdles?
>
> Karen G. Schneider
>
> On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland"
> <[log in to unmask]> said:
>
>> Roy, do you have an answer in mind?
>>
>> To me & my project it's the content that is open, which is why it's worth
>> the hurdles.  Once you 'crack the nut' you can grab metadata, scans, and
>> derivatives and ingest, parse, recombine, remix...as we've done for BHL.
>>
>> Access to OCA content may not be standards-based, but it works.
>>
>> Chris
>>
>> -----Original Message-----
>> From: "Roy Tennant" <[log in to unmask]>
>> To: "[log in to unmask]" <[log in to unmask]>
>> Sent: 2/27/2008 5:28 AM
>> Subject: Re: [CODE4LIB] oca api?
>>
>> So what, exactly, is "open" about this? Anyone care to guess?
>> Roy
>>
>>
>> On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote:
>>
>>
>>> My guess is that, yes, the query interface we've been discussing here
>>> and the 'all sorts of interfaces that none of us knew about' are the
>>> same.  It's not documented that I'm aware of.  We've found out about it
>>> by literally sitting next to IA developers and asking questions.
>>>
>>> Chris
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>>> Jonathan Rochkind
>>> Sent: Tuesday, February 26, 2008 12:18 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] oca api?
>>>
>>> So in answer to my question here at the Code4Lib conference, after
>>> Brewster's keynote, Brewster suggests there are all sorts of interfaces
>>> that none of us knew about. Or at least I didn't know about, and haven't
>>> been able to figure out in months of trying!  I'm going to try and
>>> corner him and ask for an email of who we should contact.
>>>
>>> Perhaps it's the XML interface that you guys know about already. Is that
>>> documented anywhere? How the heck did you find out about it?
>>>
>>> Jonathan
>>>
>>>
>>>
>>>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
>>>>>>
>>> I'll add that when IA told me about
>>> http://www.archive.org/services/search.php interface to return
>>> XML, they asked that we not send more than 100 records at time since
>>> doing more would adversely
>>> affect production services. Which made it seem like OAI-PMH was a better
>>> way to go.
>>>
>>> Chris, can you explain a bit more about what this means: "We found their
>>> OAI interface to pull
>>> scanned items inconsistently based on date of scanning...."? I'm having
>>> trouble parsing.
>>>
>>>
>>>    --SET
>>>
>>>
>>>
>>>
>>> --- Chris Freeland <[log in to unmask]> wrote:
>>>
>>>
>>>> Jonathan - No, I don't believe it's documented - at least not anywhere
>>>> publicly.  If any IA/OCA folks are lurking, here's an opportunity to
>>>> make a bunch of techies happy...
>>>>
>>>> Chris
>>>>
>>>> -----Original Message-----
>>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>>>>
>>> Of
>>>
>>>> Jonathan Rochkind
>>>> Sent: Monday, February 25, 2008 2:48 PM
>>>> To: [log in to unmask]
>>>> Subject: Re: [CODE4LIB] oca api?
>>>>
>>>> I hadn't known this "custom query interface" existed! This is welcome
>>>> news. Is this documented anywhere?
>>>>
>>>> Jonathan
>>>>
>>>>
>>>>
>>>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
>>>>>>>
>>>> Steve & Tim,
>>>>
>>>> I'm the tech director for the Biodiversity Heritage Library (BHL),
>>>>
>>> which
>>>
>>>> is a consortium of 10 natural history libraries who have partnered
>>>>
>>> with
>>>
>>>> Internet Archive (IA)/OCA for scanning our collections.  We've just
>>>> launched our revamped portal, complete with more than 7,500 books &
>>>>
>>> 2.8
>>>
>>>> million pages scanned by IA & other digitization partners, at:
>>>> http://www.biodiversitylibrary.org
>>>>
>>>> To build this portal we ingest metadata from IA.  We found their OAI
>>>> interface to pull scanned items inconsistently based on date of
>>>> scanning, so we switched to using their custom query interface.
>>>>
>>> Here's
>>>
>>>> an example of a query we fire off:
>>>>
>>>>
>>>>
>>> http://www.archive.org/services/search.php?query=collection:(biodiversit
>>>
>>> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
>>>
>>>> OI%20Library)&limit=10&submit=submit
>>>>
>>>> This is returning scanned items from the "biodiversity" collection,
>>>> updated between 10/31/2007 - 11/30/2007, restricted to one of our
>>>> contributing libraries (MBLWHOI Library), and limited to 10 results.
>>>>
>>>> The results are styled in the browser; view source to see the good
>>>> stuff.  We use this list to grab the identifiers we've yet to ingest.
>>>>
>>>> Some background: When a book is scanned through IA/OCA scanning, they
>>>> create their own unique identifier (like "annalesacademiae21univ") and
>>>> grab a MARC record from the contributing library's catalog.  All of
>>>>
>>> the
>>>
>>>> scanned files, derivatives, and metadata files are stored on IA's
>>>> clusters in a directory named with the identifier.
>>>>
>>>> Steve mentioned using their /details/ directive, then sniffing the
>>>>
>>> page
>>>
>>>> to get the cluster location and the files for downloading.  An easier
>>>> method is to use their /download/ directive, as in:
>>>>
>>>> http://www.archive.org/download/ID$, or in the example above:
>>>> http://www.archive.org/download/annalesacademiae21univ
>>>>
>>>> That automatically does a lookup on the cluster, which means you don't
>>>> have to scrape info off pages.  You can also address any files within
>>>> that directory, as in:
>>>>
>>>>
>>> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
>>>
>>>> 1univ_marc.xml
>>>>
>>>> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
>>>> these scanned books is to grab them out of the MARC record.  So the
>>>> long-winded answer to your question, Tim, is no, there's no simple way
>>>> to crossref what IA has scanned with your catalog - THAT I KNOW OF.
>>>>
>>> Big
>>>
>>>> caveat on that last part.
>>>>
>>>> Happy to help with any other questions I can,
>>>>
>>>> Chris Freeland
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>>>>
>>> Of
>>>
>>>> Steve Toub
>>>> Sent: Sunday, February 24, 2008 11:20 PM
>>>> To: [log in to unmask]
>>>> Subject: Re: [CODE4LIB] oca api?
>>>>
>>>> --- Tim Shearer <[log in to unmask]> wrote:
>>>>
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> I'm looking into tapping the texts in the Open Content Alliance.
>>>>>
>>>>> A few questions...
>>>>>
>>>>> As near as I can tell, they don't expose (perhaps even store?) any
>>>>>
>>>> common
>>>>
>>>>> unique identifiers (oclc number, issn, isbn, loc number).
>>>>>
>>>> I poked around in this world a few months ago in my previous job at
>>>> California Digital Library,
>>>> also an OCA partner.
>>>>
>>>> The unique key seems to be text string identifier (one that seems to
>>>>
>>> be
>>>
>>>> completely different from
>>>> the text string identifier in Open Library). Apparently there was talk
>>>> at the last partner meeting
>>>> about moving to ISBNs:
>>>>
>>>>
>>> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
>>>
>>>> lliance/
>>>>
>>>> To obtain identifiers in bulk, I think the recommended approach is the
>>>> OAI-PMH interface, which
>>>> seems more reliable in recent months:
>>>>
>>>> http://www.archive.org/services/oai.php?verb=Identify
>>>>
>>>>
>>>>
>>> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
>>>
>>>> fix=oai_dc&set=collection:cdl
>>>>
>>>> etc.
>>>>
>>>>
>>>> Additional instructions if you want to grab the content files.
>>>>
>>>> From any book's metadata page (e.g.,
>>>> http://www.archive.org/details/chemicallecturee00newtrich)
>>>> click through on the "Usage Rights: See Terms" link; the rights are on
>>>>
>>> a
>>>
>>>> pane on the left-hand
>>>> side.
>>>>
>>>> Once you know the identifier, you can grab the content files, using
>>>>
>>> this
>>>
>>>> syntax:
>>>>     http://www.archive.org/details/$ID
>>>> Like so:
>>>>     http://www.archive.org/details/chemicallecturee00newtrich
>>>>
>>>> And then sniff the page to find the FTP link:
>>>>     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>>
>>>> But I think they prefer to use HTTP for these, not the FTP, so switch
>>>> this to:
>>>>     http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>>
>>>> Hope this helps!
>>>>
>>>>   --SET
>>>>
>>>>
>>>>
>>>>> We're a contributer so I can use curl to grab our records via http
>>>>>
>>>> (and
>>>>
>>>>> regexp my way to our local catalog identifiers, which they do
>>>>> store/expose).
>>>>>
>>>>> I've played a bit with the z39.50 interface at indexdata
>>>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
>>>>>
>>>> the
>>>>
>>>>> content behind it.  I get very limited results, for instance I can't
>>>>>
>>>> find
>>>>
>>>>> any UNC records and we're fairly new to the game.
>>>>>
>>>>> Again, I'm looking for unique identifiers in what I can get back and
>>>>>
>>>> it's
>>>>
>>>>> slim pickings.
>>>>>
>>>>> Anyone cracked this nut?  Got any life lessons for me?
>>>>>
>>>>> Thanks!
>>>>> Tim
>>>>>
>>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>> Tim Shearer
>>>>>
>>>>> Web Development Coordinator
>>>>> The University Library
>>>>> University of North Carolina at Chapel Hill
>>>>> [log in to unmask]
>>>>> 919-962-1288
>>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>> --
>>
>
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu