Print

Print


So what, exactly, is "open" about this? Anyone care to guess?
Roy


On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote:

> My guess is that, yes, the query interface we've been discussing here
> and the 'all sorts of interfaces that none of us knew about' are the
> same.  It's not documented that I'm aware of.  We've found out about it
> by literally sitting next to IA developers and asking questions.
>
> Chris
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, February 26, 2008 12:18 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] oca api?
>
> So in answer to my question here at the Code4Lib conference, after
> Brewster's keynote, Brewster suggests there are all sorts of interfaces
> that none of us knew about. Or at least I didn't know about, and haven't
> been able to figure out in months of trying!  I'm going to try and
> corner him and ask for an email of who we should contact.
>
> Perhaps it's the XML interface that you guys know about already. Is that
> documented anywhere? How the heck did you find out about it?
>
> Jonathan
>
>
>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
> I'll add that when IA told me about
> http://www.archive.org/services/search.php interface to return
> XML, they asked that we not send more than 100 records at time since
> doing more would adversely
> affect production services. Which made it seem like OAI-PMH was a better
> way to go.
>
> Chris, can you explain a bit more about what this means: "We found their
> OAI interface to pull
> scanned items inconsistently based on date of scanning...."? I'm having
> trouble parsing.
>
>
>    --SET
>
>
>
>
> --- Chris Freeland <[log in to unmask]> wrote:
>
>> Jonathan - No, I don't believe it's documented - at least not anywhere
>> publicly.  If any IA/OCA folks are lurking, here's an opportunity to
>> make a bunch of techies happy...
>>
>> Chris
>>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of
>> Jonathan Rochkind
>> Sent: Monday, February 25, 2008 2:48 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] oca api?
>>
>> I hadn't known this "custom query interface" existed! This is welcome
>> news. Is this documented anywhere?
>>
>> Jonathan
>>
>>
>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
>> Steve & Tim,
>>
>> I'm the tech director for the Biodiversity Heritage Library (BHL),
> which
>> is a consortium of 10 natural history libraries who have partnered
> with
>> Internet Archive (IA)/OCA for scanning our collections.  We've just
>> launched our revamped portal, complete with more than 7,500 books &
> 2.8
>> million pages scanned by IA & other digitization partners, at:
>> http://www.biodiversitylibrary.org
>>
>> To build this portal we ingest metadata from IA.  We found their OAI
>> interface to pull scanned items inconsistently based on date of
>> scanning, so we switched to using their custom query interface.
> Here's
>> an example of a query we fire off:
>>
>>
> http://www.archive.org/services/search.php?query=collection:(biodiversit
>>
> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
>> OI%20Library)&limit=10&submit=submit
>>
>> This is returning scanned items from the "biodiversity" collection,
>> updated between 10/31/2007 - 11/30/2007, restricted to one of our
>> contributing libraries (MBLWHOI Library), and limited to 10 results.
>>
>> The results are styled in the browser; view source to see the good
>> stuff.  We use this list to grab the identifiers we've yet to ingest.
>>
>> Some background: When a book is scanned through IA/OCA scanning, they
>> create their own unique identifier (like "annalesacademiae21univ") and
>> grab a MARC record from the contributing library's catalog.  All of
> the
>> scanned files, derivatives, and metadata files are stored on IA's
>> clusters in a directory named with the identifier.
>>
>> Steve mentioned using their /details/ directive, then sniffing the
> page
>> to get the cluster location and the files for downloading.  An easier
>> method is to use their /download/ directive, as in:
>>
>> http://www.archive.org/download/ID$, or in the example above:
>> http://www.archive.org/download/annalesacademiae21univ
>>
>> That automatically does a lookup on the cluster, which means you don't
>> have to scrape info off pages.  You can also address any files within
>> that directory, as in:
>>
> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
>> 1univ_marc.xml
>>
>> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
>> these scanned books is to grab them out of the MARC record.  So the
>> long-winded answer to your question, Tim, is no, there's no simple way
>> to crossref what IA has scanned with your catalog - THAT I KNOW OF.
> Big
>> caveat on that last part.
>>
>> Happy to help with any other questions I can,
>>
>> Chris Freeland
>>
>>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of
>> Steve Toub
>> Sent: Sunday, February 24, 2008 11:20 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] oca api?
>>
>> --- Tim Shearer <[log in to unmask]> wrote:
>>
>>> Hi Folks,
>>>
>>> I'm looking into tapping the texts in the Open Content Alliance.
>>>
>>> A few questions...
>>>
>>> As near as I can tell, they don't expose (perhaps even store?) any
>> common
>>> unique identifiers (oclc number, issn, isbn, loc number).
>>
>> I poked around in this world a few months ago in my previous job at
>> California Digital Library,
>> also an OCA partner.
>>
>> The unique key seems to be text string identifier (one that seems to
> be
>> completely different from
>> the text string identifier in Open Library). Apparently there was talk
>> at the last partner meeting
>> about moving to ISBNs:
>>
> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
>> lliance/
>>
>> To obtain identifiers in bulk, I think the recommended approach is the
>> OAI-PMH interface, which
>> seems more reliable in recent months:
>>
>> http://www.archive.org/services/oai.php?verb=Identify
>>
>>
> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
>> fix=oai_dc&set=collection:cdl
>>
>> etc.
>>
>>
>> Additional instructions if you want to grab the content files.
>>
>> From any book's metadata page (e.g.,
>> http://www.archive.org/details/chemicallecturee00newtrich)
>> click through on the "Usage Rights: See Terms" link; the rights are on
> a
>> pane on the left-hand
>> side.
>>
>> Once you know the identifier, you can grab the content files, using
> this
>> syntax:
>>     http://www.archive.org/details/$ID
>> Like so:
>>     http://www.archive.org/details/chemicallecturee00newtrich
>>
>> And then sniff the page to find the FTP link:
>>     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>
>> But I think they prefer to use HTTP for these, not the FTP, so switch
>> this to:
>>     http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>
>> Hope this helps!
>>
>>   --SET
>>
>>
>>> We're a contributer so I can use curl to grab our records via http
>> (and
>>> regexp my way to our local catalog identifiers, which they do
>>> store/expose).
>>>
>>> I've played a bit with the z39.50 interface at indexdata
>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
>> the
>>> content behind it.  I get very limited results, for instance I can't
>> find
>>> any UNC records and we're fairly new to the game.
>>>
>>> Again, I'm looking for unique identifiers in what I can get back and
>> it's
>>> slim pickings.
>>>
>>> Anyone cracked this nut?  Got any life lessons for me?
>>>
>>> Thanks!
>>> Tim
>>>
>>> +++++++++++++++++++++++++++++++++++++++++++
>>> Tim Shearer
>>>
>>> Web Development Coordinator
>>> The University Library
>>> University of North Carolina at Chapel Hill
>>> [log in to unmask]
>>> 919-962-1288
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>
>>

--