It is the same interface Chris described. I had emailed with Brewster directly to learn about it. In that email exchange I got the sense that OAI-PMH was better. And my questions about a staging instance went unanswered. But in standing in here when Jonathan cornered Brewster, I got the sense he prefers the query interface. He didn't set concrete guidance about how many queries is too much but he was conscious of performance. --SET --- Chris Freeland <[log in to unmask]> wrote: > My guess is that, yes, the query interface we've been discussing here > and the 'all sorts of interfaces that none of us knew about' are the > same. It's not documented that I'm aware of. We've found out about it > by literally sitting next to IA developers and asking questions. > > Chris > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, February 26, 2008 12:18 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] oca api? > > So in answer to my question here at the Code4Lib conference, after > Brewster's keynote, Brewster suggests there are all sorts of interfaces > that none of us knew about. Or at least I didn't know about, and haven't > been able to figure out in months of trying! I'm going to try and > corner him and ask for an email of who we should contact. > > Perhaps it's the XML interface that you guys know about already. Is that > documented anywhere? How the heck did you find out about it? > > Jonathan > > > >>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>> > I'll add that when IA told me about > http://www.archive.org/services/search.php interface to return > XML, they asked that we not send more than 100 records at time since > doing more would adversely > affect production services. Which made it seem like OAI-PMH was a better > way to go. > > Chris, can you explain a bit more about what this means: "We found their > OAI interface to pull > scanned items inconsistently based on date of scanning...."? I'm having > trouble parsing. > > > --SET > > > > > --- Chris Freeland <[log in to unmask]> wrote: > > > Jonathan - No, I don't believe it's documented - at least not anywhere > > publicly. If any IA/OCA folks are lurking, here's an opportunity to > > make a bunch of techies happy... > > > > Chris > > > > -----Original Message----- > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf > Of > > Jonathan Rochkind > > Sent: Monday, February 25, 2008 2:48 PM > > To: [log in to unmask] > > Subject: Re: [CODE4LIB] oca api? > > > > I hadn't known this "custom query interface" existed! This is welcome > > news. Is this documented anywhere? > > > > Jonathan > > > > > > >>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>> > > Steve & Tim, > > > > I'm the tech director for the Biodiversity Heritage Library (BHL), > which > > is a consortium of 10 natural history libraries who have partnered > with > > Internet Archive (IA)/OCA for scanning our collections. We've just > > launched our revamped portal, complete with more than 7,500 books & > 2.8 > > million pages scanned by IA & other digitization partners, at: > > http://www.biodiversitylibrary.org > > > > To build this portal we ingest metadata from IA. We found their OAI > > interface to pull scanned items inconsistently based on date of > > scanning, so we switched to using their custom query interface. > Here's > > an example of a query we fire off: > > > > > http://www.archive.org/services/search.php?query=collection:(biodiversit > > > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > > OI%20Library)&limit=10&submit=submit > > > > This is returning scanned items from the "biodiversity" collection, > > updated between 10/31/2007 - 11/30/2007, restricted to one of our > > contributing libraries (MBLWHOI Library), and limited to 10 results. > > > > The results are styled in the browser; view source to see the good > > stuff. We use this list to grab the identifiers we've yet to ingest. > > > > Some background: When a book is scanned through IA/OCA scanning, they > > create their own unique identifier (like "annalesacademiae21univ") and > > grab a MARC record from the contributing library's catalog. All of > the > > scanned files, derivatives, and metadata files are stored on IA's > > clusters in a directory named with the identifier. > > > > Steve mentioned using their /details/ directive, then sniffing the > page > > to get the cluster location and the files for downloading. An easier > > method is to use their /download/ directive, as in: > > > > http://www.archive.org/download/ID$, or in the example above: > > http://www.archive.org/download/annalesacademiae21univ > > > > That automatically does a lookup on the cluster, which means you don't > > have to scrape info off pages. You can also address any files within > > that directory, as in: > > > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > > 1univ_marc.xml > > > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > > these scanned books is to grab them out of the MARC record. So the > > long-winded answer to your question, Tim, is no, there's no simple way > > to crossref what IA has scanned with your catalog - THAT I KNOW OF. > Big > > caveat on that last part. > > > > Happy to help with any other questions I can, > > > > Chris Freeland > > > > > > -----Original Message----- > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf > Of > > Steve Toub > > Sent: Sunday, February 24, 2008 11:20 PM > > To: [log in to unmask] > > Subject: Re: [CODE4LIB] oca api? > > > > --- Tim Shearer <[log in to unmask]> wrote: > > > > > Hi Folks, > > > > > > I'm looking into tapping the texts in the Open Content Alliance. > > > > > > A few questions... > > > > > > As near as I can tell, they don't expose (perhaps even store?) any > > common > > > unique identifiers (oclc number, issn, isbn, loc number). > > > > I poked around in this world a few months ago in my previous job at > > California Digital Library, > > also an OCA partner. > > > > The unique key seems to be text string identifier (one that seems to > be > > completely different from > > the text string identifier in Open Library). Apparently there was talk > > at the last partner meeting > > about moving to ISBNs: > > > http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a > > lliance/ > > > > To obtain identifiers in bulk, I think the recommended approach is the > > OAI-PMH interface, which > > seems more reliable in recent months: > > > > http://www.archive.org/services/oai.php?verb=Identify > > > > > http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre > > fix=oai_dc&set=collection:cdl > > > > etc. > > > > > > Additional instructions if you want to grab the content files. > > > > From any book's metadata page (e.g., > > http://www.archive.org/details/chemicallecturee00newtrich) > > click through on the "Usage Rights: See Terms" link; the rights are on > a > > pane on the left-hand > > side. > > > > Once you know the identifier, you can grab the content files, using > this > > syntax: > > http://www.archive.org/details/$ID > > Like so: > > http://www.archive.org/details/chemicallecturee00newtrich > > > > And then sniff the page to find the FTP link: > > ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > > > > But I think they prefer to use HTTP for these, not the FTP, so switch > > this to: > > http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > > > > Hope this helps! > > > > --SET > > > > > > > We're a contributer so I can use curl to grab our records via http > > (and > > > regexp my way to our local catalog identifiers, which they do > > > store/expose). > > > > > > I've played a bit with the z39.50 interface at indexdata > > > (http://www.indexdata.dk/opencontent/), but I'm not confident about > > the > > > content behind it. I get very limited results, for instance I can't > > find > > > any UNC records and we're fairly new to the game. > > > > > > Again, I'm looking for unique identifiers in what I can get back and > > it's > > > slim pickings. > > > > > > Anyone cracked this nut? Got any life lessons for me? > > > > > > Thanks! > > > Tim > > > > > > +++++++++++++++++++++++++++++++++++++++++++ > > > Tim Shearer > > > > > > Web Development Coordinator > > > The University Library > > > University of North Carolina at Chapel Hill > > > [log in to unmask] > > > 919-962-1288 > > > +++++++++++++++++++++++++++++++++++++++++++ > > > > > >