LISTSERV 16.5 - CODE4LIB Archives

It is the same interface Chris described. I had emailed with Brewster directly to learn about it.

In that email exchange I got the sense that OAI-PMH was better. And my questions about a staging
instance went unanswered. But in standing in here when Jonathan cornered Brewster, I got the sense
he prefers the query interface. He didn't set concrete guidance about how many queries is too much
but he was conscious of performance.
   --SET





--- Chris Freeland <[log in to unmask]> wrote:

> My guess is that, yes, the query interface we've been discussing here
> and the 'all sorts of interfaces that none of us knew about' are the
> same.  It's not documented that I'm aware of.  We've found out about it
> by literally sitting next to IA developers and asking questions.
>
> Chris
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, February 26, 2008 12:18 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] oca api?
>
> So in answer to my question here at the Code4Lib conference, after
> Brewster's keynote, Brewster suggests there are all sorts of interfaces
> that none of us knew about. Or at least I didn't know about, and haven't
> been able to figure out in months of trying!  I'm going to try and
> corner him and ask for an email of who we should contact.
>
> Perhaps it's the XML interface that you guys know about already. Is that
> documented anywhere? How the heck did you find out about it?
>
> Jonathan
>
>
> >>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
> I'll add that when IA told me about
> http://www.archive.org/services/search.php interface to return
> XML, they asked that we not send more than 100 records at time since
> doing more would adversely
> affect production services. Which made it seem like OAI-PMH was a better
> way to go.
>
> Chris, can you explain a bit more about what this means: "We found their
> OAI interface to pull
> scanned items inconsistently based on date of scanning...."? I'm having
> trouble parsing.
>
>
>    --SET
>
>
>
>
> --- Chris Freeland <[log in to unmask]> wrote:
>
> > Jonathan - No, I don't believe it's documented - at least not anywhere
> > publicly.  If any IA/OCA folks are lurking, here's an opportunity to
> > make a bunch of techies happy...
> >
> > Chris
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of
> > Jonathan Rochkind
> > Sent: Monday, February 25, 2008 2:48 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] oca api?
> >
> > I hadn't known this "custom query interface" existed! This is welcome
> > news. Is this documented anywhere?
> >
> > Jonathan
> >
> >
> > >>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
> > Steve & Tim,
> >
> > I'm the tech director for the Biodiversity Heritage Library (BHL),
> which
> > is a consortium of 10 natural history libraries who have partnered
> with
> > Internet Archive (IA)/OCA for scanning our collections.  We've just
> > launched our revamped portal, complete with more than 7,500 books &
> 2.8
> > million pages scanned by IA & other digitization partners, at:
> > http://www.biodiversitylibrary.org
> >
> > To build this portal we ingest metadata from IA.  We found their OAI
> > interface to pull scanned items inconsistently based on date of
> > scanning, so we switched to using their custom query interface.
> Here's
> > an example of a query we fire off:
> >
> >
> http://www.archive.org/services/search.php?query=collection:(biodiversit
> >
> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
> > OI%20Library)&limit=10&submit=submit
> >
> > This is returning scanned items from the "biodiversity" collection,
> > updated between 10/31/2007 - 11/30/2007, restricted to one of our
> > contributing libraries (MBLWHOI Library), and limited to 10 results.
> >
> > The results are styled in the browser; view source to see the good
> > stuff.  We use this list to grab the identifiers we've yet to ingest.
> >
> > Some background: When a book is scanned through IA/OCA scanning, they
> > create their own unique identifier (like "annalesacademiae21univ") and
> > grab a MARC record from the contributing library's catalog.  All of
> the
> > scanned files, derivatives, and metadata files are stored on IA's
> > clusters in a directory named with the identifier.
> >
> > Steve mentioned using their /details/ directive, then sniffing the
> page
> > to get the cluster location and the files for downloading.  An easier
> > method is to use their /download/ directive, as in:
> >
> > http://www.archive.org/download/ID$, or in the example above:
> > http://www.archive.org/download/annalesacademiae21univ
> >
> > That automatically does a lookup on the cluster, which means you don't
> > have to scrape info off pages.  You can also address any files within
> > that directory, as in:
> >
> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
> > 1univ_marc.xml
> >
> > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
> > these scanned books is to grab them out of the MARC record.  So the
> > long-winded answer to your question, Tim, is no, there's no simple way
> > to crossref what IA has scanned with your catalog - THAT I KNOW OF.
> Big
> > caveat on that last part.
> >
> > Happy to help with any other questions I can,
> >
> > Chris Freeland
> >
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of
> > Steve Toub
> > Sent: Sunday, February 24, 2008 11:20 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] oca api?
> >
> > --- Tim Shearer <[log in to unmask]> wrote:
> >
> > > Hi Folks,
> > >
> > > I'm looking into tapping the texts in the Open Content Alliance.
> > >
> > > A few questions...
> > >
> > > As near as I can tell, they don't expose (perhaps even store?) any
> > common
> > > unique identifiers (oclc number, issn, isbn, loc number).
> >
> > I poked around in this world a few months ago in my previous job at
> > California Digital Library,
> > also an OCA partner.
> >
> > The unique key seems to be text string identifier (one that seems to
> be
> > completely different from
> > the text string identifier in Open Library). Apparently there was talk
> > at the last partner meeting
> > about moving to ISBNs:
> >
> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
> > lliance/
> >
> > To obtain identifiers in bulk, I think the recommended approach is the
> > OAI-PMH interface, which
> > seems more reliable in recent months:
> >
> > http://www.archive.org/services/oai.php?verb=Identify
> >
> >
> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
> > fix=oai_dc&set=collection:cdl
> >
> > etc.
> >
> >
> > Additional instructions if you want to grab the content files.
> >
> > From any book's metadata page (e.g.,
> > http://www.archive.org/details/chemicallecturee00newtrich)
> > click through on the "Usage Rights: See Terms" link; the rights are on
> a
> > pane on the left-hand
> > side.
> >
> > Once you know the identifier, you can grab the content files, using
> this
> > syntax:
> >     http://www.archive.org/details/$ID
> > Like so:
> >     http://www.archive.org/details/chemicallecturee00newtrich
> >
> > And then sniff the page to find the FTP link:
> >     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
> >
> > But I think they prefer to use HTTP for these, not the FTP, so switch
> > this to:
> >     http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
> >
> > Hope this helps!
> >
> >   --SET
> >
> >
> > > We're a contributer so I can use curl to grab our records via http
> > (and
> > > regexp my way to our local catalog identifiers, which they do
> > > store/expose).
> > >
> > > I've played a bit with the z39.50 interface at indexdata
> > > (http://www.indexdata.dk/opencontent/), but I'm not confident about
> > the
> > > content behind it.  I get very limited results, for instance I can't
> > find
> > > any UNC records and we're fairly new to the game.
> > >
> > > Again, I'm looking for unique identifiers in what I can get back and
> > it's
> > > slim pickings.
> > >
> > > Anyone cracked this nut?  Got any life lessons for me?
> > >
> > > Thanks!
> > > Tim
> > >
> > > +++++++++++++++++++++++++++++++++++++++++++
> > > Tim Shearer
> > >
> > > Web Development Coordinator
> > > The University Library
> > > University of North Carolina at Chapel Hill
> > > [log in to unmask]
> > > 919-962-1288
> > > +++++++++++++++++++++++++++++++++++++++++++
> > >
> >
>