LISTSERV 16.5 - CODE4LIB Archives

But why are there hurdles?

Karen G. Schneider

On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland"
<[log in to unmask]> said:
> Roy, do you have an answer in mind?
>
> To me & my project it's the content that is open, which is why it's worth
> the hurdles.  Once you 'crack the nut' you can grab metadata, scans, and
> derivatives and ingest, parse, recombine, remix...as we've done for BHL.
>
> Access to OCA content may not be standards-based, but it works.
>
> Chris
>
> -----Original Message-----
> From: "Roy Tennant" <[log in to unmask]>
> To: "[log in to unmask]" <[log in to unmask]>
> Sent: 2/27/2008 5:28 AM
> Subject: Re: [CODE4LIB] oca api?
>
> So what, exactly, is "open" about this? Anyone care to guess?
> Roy
>
>
> On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote:
>
> > My guess is that, yes, the query interface we've been discussing here
> > and the 'all sorts of interfaces that none of us knew about' are the
> > same.  It's not documented that I'm aware of.  We've found out about it
> > by literally sitting next to IA developers and asking questions.
> >
> > Chris
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> > Jonathan Rochkind
> > Sent: Tuesday, February 26, 2008 12:18 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] oca api?
> >
> > So in answer to my question here at the Code4Lib conference, after
> > Brewster's keynote, Brewster suggests there are all sorts of interfaces
> > that none of us knew about. Or at least I didn't know about, and haven't
> > been able to figure out in months of trying!  I'm going to try and
> > corner him and ask for an email of who we should contact.
> >
> > Perhaps it's the XML interface that you guys know about already. Is that
> > documented anywhere? How the heck did you find out about it?
> >
> > Jonathan
> >
> >
> >>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
> > I'll add that when IA told me about
> > http://www.archive.org/services/search.php interface to return
> > XML, they asked that we not send more than 100 records at time since
> > doing more would adversely
> > affect production services. Which made it seem like OAI-PMH was a better
> > way to go.
> >
> > Chris, can you explain a bit more about what this means: "We found their
> > OAI interface to pull
> > scanned items inconsistently based on date of scanning...."? I'm having
> > trouble parsing.
> >
> >
> >    --SET
> >
> >
> >
> >
> > --- Chris Freeland <[log in to unmask]> wrote:
> >
> >> Jonathan - No, I don't believe it's documented - at least not anywhere
> >> publicly.  If any IA/OCA folks are lurking, here's an opportunity to
> >> make a bunch of techies happy...
> >>
> >> Chris
> >>
> >> -----Original Message-----
> >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of
> >> Jonathan Rochkind
> >> Sent: Monday, February 25, 2008 2:48 PM
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] oca api?
> >>
> >> I hadn't known this "custom query interface" existed! This is welcome
> >> news. Is this documented anywhere?
> >>
> >> Jonathan
> >>
> >>
> >>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
> >> Steve & Tim,
> >>
> >> I'm the tech director for the Biodiversity Heritage Library (BHL),
> > which
> >> is a consortium of 10 natural history libraries who have partnered
> > with
> >> Internet Archive (IA)/OCA for scanning our collections.  We've just
> >> launched our revamped portal, complete with more than 7,500 books &
> > 2.8
> >> million pages scanned by IA & other digitization partners, at:
> >> http://www.biodiversitylibrary.org
> >>
> >> To build this portal we ingest metadata from IA.  We found their OAI
> >> interface to pull scanned items inconsistently based on date of
> >> scanning, so we switched to using their custom query interface.
> > Here's
> >> an example of a query we fire off:
> >>
> >>
> > http://www.archive.org/services/search.php?query=collection:(biodiversit
> >>
> > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
> >> OI%20Library)&limit=10&submit=submit
> >>
> >> This is returning scanned items from the "biodiversity" collection,
> >> updated between 10/31/2007 - 11/30/2007, restricted to one of our
> >> contributing libraries (MBLWHOI Library), and limited to 10 results.
> >>
> >> The results are styled in the browser; view source to see the good
> >> stuff.  We use this list to grab the identifiers we've yet to ingest.
> >>
> >> Some background: When a book is scanned through IA/OCA scanning, they
> >> create their own unique identifier (like "annalesacademiae21univ") and
> >> grab a MARC record from the contributing library's catalog.  All of
> > the
> >> scanned files, derivatives, and metadata files are stored on IA's
> >> clusters in a directory named with the identifier.
> >>
> >> Steve mentioned using their /details/ directive, then sniffing the
> > page
> >> to get the cluster location and the files for downloading.  An easier
> >> method is to use their /download/ directive, as in:
> >>
> >> http://www.archive.org/download/ID$, or in the example above:
> >> http://www.archive.org/download/annalesacademiae21univ
> >>
> >> That automatically does a lookup on the cluster, which means you don't
> >> have to scrape info off pages.  You can also address any files within
> >> that directory, as in:
> >>
> > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
> >> 1univ_marc.xml
> >>
> >> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
> >> these scanned books is to grab them out of the MARC record.  So the
> >> long-winded answer to your question, Tim, is no, there's no simple way
> >> to crossref what IA has scanned with your catalog - THAT I KNOW OF.
> > Big
> >> caveat on that last part.
> >>
> >> Happy to help with any other questions I can,
> >>
> >> Chris Freeland
> >>
> >>
> >> -----Original Message-----
> >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of
> >> Steve Toub
> >> Sent: Sunday, February 24, 2008 11:20 PM
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] oca api?
> >>
> >> --- Tim Shearer <[log in to unmask]> wrote:
> >>
> >>> Hi Folks,
> >>>
> >>> I'm looking into tapping the texts in the Open Content Alliance.
> >>>
> >>> A few questions...
> >>>
> >>> As near as I can tell, they don't expose (perhaps even store?) any
> >> common
> >>> unique identifiers (oclc number, issn, isbn, loc number).
> >>
> >> I poked around in this world a few months ago in my previous job at
> >> California Digital Library,
> >> also an OCA partner.
> >>
> >> The unique key seems to be text string identifier (one that seems to
> > be
> >> completely different from
> >> the text string identifier in Open Library). Apparently there was talk
> >> at the last partner meeting
> >> about moving to ISBNs:
> >>
> > http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
> >> lliance/
> >>
> >> To obtain identifiers in bulk, I think the recommended approach is the
> >> OAI-PMH interface, which
> >> seems more reliable in recent months:
> >>
> >> http://www.archive.org/services/oai.php?verb=Identify
> >>
> >>
> > http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
> >> fix=oai_dc&set=collection:cdl
> >>
> >> etc.
> >>
> >>
> >> Additional instructions if you want to grab the content files.
> >>
> >> From any book's metadata page (e.g.,
> >> http://www.archive.org/details/chemicallecturee00newtrich)
> >> click through on the "Usage Rights: See Terms" link; the rights are on
> > a
> >> pane on the left-hand
> >> side.
> >>
> >> Once you know the identifier, you can grab the content files, using
> > this
> >> syntax:
> >>     http://www.archive.org/details/$ID
> >> Like so:
> >>     http://www.archive.org/details/chemicallecturee00newtrich
> >>
> >> And then sniff the page to find the FTP link:
> >>     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
> >>
> >> But I think they prefer to use HTTP for these, not the FTP, so switch
> >> this to:
> >>     http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
> >>
> >> Hope this helps!
> >>
> >>   --SET
> >>
> >>
> >>> We're a contributer so I can use curl to grab our records via http
> >> (and
> >>> regexp my way to our local catalog identifiers, which they do
> >>> store/expose).
> >>>
> >>> I've played a bit with the z39.50 interface at indexdata
> >>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
> >> the
> >>> content behind it.  I get very limited results, for instance I can't
> >> find
> >>> any UNC records and we're fairly new to the game.
> >>>
> >>> Again, I'm looking for unique identifiers in what I can get back and
> >> it's
> >>> slim pickings.
> >>>
> >>> Anyone cracked this nut?  Got any life lessons for me?
> >>>
> >>> Thanks!
> >>> Tim
> >>>
> >>> +++++++++++++++++++++++++++++++++++++++++++
> >>> Tim Shearer
> >>>
> >>> Web Development Coordinator
> >>> The University Library
> >>> University of North Carolina at Chapel Hill
> >>> [log in to unmask]
> >>> 919-962-1288
> >>> +++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>
>
> --