So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote: > My guess is that, yes, the query interface we've been discussing here > and the 'all sorts of interfaces that none of us knew about' are the > same. It's not documented that I'm aware of. We've found out about it > by literally sitting next to IA developers and asking questions. > > Chris > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, February 26, 2008 12:18 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] oca api? > > So in answer to my question here at the Code4Lib conference, after > Brewster's keynote, Brewster suggests there are all sorts of interfaces > that none of us knew about. Or at least I didn't know about, and haven't > been able to figure out in months of trying! I'm going to try and > corner him and ask for an email of who we should contact. > > Perhaps it's the XML interface that you guys know about already. Is that > documented anywhere? How the heck did you find out about it? > > Jonathan > > >>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>> > I'll add that when IA told me about > http://www.archive.org/services/search.php interface to return > XML, they asked that we not send more than 100 records at time since > doing more would adversely > affect production services. Which made it seem like OAI-PMH was a better > way to go. > > Chris, can you explain a bit more about what this means: "We found their > OAI interface to pull > scanned items inconsistently based on date of scanning...."? I'm having > trouble parsing. > > > --SET > > > > > --- Chris Freeland <[log in to unmask]> wrote: > >> Jonathan - No, I don't believe it's documented - at least not anywhere >> publicly. If any IA/OCA folks are lurking, here's an opportunity to >> make a bunch of techies happy... >> >> Chris >> >> -----Original Message----- >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf > Of >> Jonathan Rochkind >> Sent: Monday, February 25, 2008 2:48 PM >> To: [log in to unmask] >> Subject: Re: [CODE4LIB] oca api? >> >> I hadn't known this "custom query interface" existed! This is welcome >> news. Is this documented anywhere? >> >> Jonathan >> >> >>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>> >> Steve & Tim, >> >> I'm the tech director for the Biodiversity Heritage Library (BHL), > which >> is a consortium of 10 natural history libraries who have partnered > with >> Internet Archive (IA)/OCA for scanning our collections. We've just >> launched our revamped portal, complete with more than 7,500 books & > 2.8 >> million pages scanned by IA & other digitization partners, at: >> http://www.biodiversitylibrary.org >> >> To build this portal we ingest metadata from IA. We found their OAI >> interface to pull scanned items inconsistently based on date of >> scanning, so we switched to using their custom query interface. > Here's >> an example of a query we fire off: >> >> > http://www.archive.org/services/search.php?query=collection:(biodiversit >> > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH >> OI%20Library)&limit=10&submit=submit >> >> This is returning scanned items from the "biodiversity" collection, >> updated between 10/31/2007 - 11/30/2007, restricted to one of our >> contributing libraries (MBLWHOI Library), and limited to 10 results. >> >> The results are styled in the browser; view source to see the good >> stuff. We use this list to grab the identifiers we've yet to ingest. >> >> Some background: When a book is scanned through IA/OCA scanning, they >> create their own unique identifier (like "annalesacademiae21univ") and >> grab a MARC record from the contributing library's catalog. All of > the >> scanned files, derivatives, and metadata files are stored on IA's >> clusters in a directory named with the identifier. >> >> Steve mentioned using their /details/ directive, then sniffing the > page >> to get the cluster location and the files for downloading. An easier >> method is to use their /download/ directive, as in: >> >> http://www.archive.org/download/ID$, or in the example above: >> http://www.archive.org/download/annalesacademiae21univ >> >> That automatically does a lookup on the cluster, which means you don't >> have to scrape info off pages. You can also address any files within >> that directory, as in: >> > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 >> 1univ_marc.xml >> >> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for >> these scanned books is to grab them out of the MARC record. So the >> long-winded answer to your question, Tim, is no, there's no simple way >> to crossref what IA has scanned with your catalog - THAT I KNOW OF. > Big >> caveat on that last part. >> >> Happy to help with any other questions I can, >> >> Chris Freeland >> >> >> -----Original Message----- >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf > Of >> Steve Toub >> Sent: Sunday, February 24, 2008 11:20 PM >> To: [log in to unmask] >> Subject: Re: [CODE4LIB] oca api? >> >> --- Tim Shearer <[log in to unmask]> wrote: >> >>> Hi Folks, >>> >>> I'm looking into tapping the texts in the Open Content Alliance. >>> >>> A few questions... >>> >>> As near as I can tell, they don't expose (perhaps even store?) any >> common >>> unique identifiers (oclc number, issn, isbn, loc number). >> >> I poked around in this world a few months ago in my previous job at >> California Digital Library, >> also an OCA partner. >> >> The unique key seems to be text string identifier (one that seems to > be >> completely different from >> the text string identifier in Open Library). Apparently there was talk >> at the last partner meeting >> about moving to ISBNs: >> > http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a >> lliance/ >> >> To obtain identifiers in bulk, I think the recommended approach is the >> OAI-PMH interface, which >> seems more reliable in recent months: >> >> http://www.archive.org/services/oai.php?verb=Identify >> >> > http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre >> fix=oai_dc&set=collection:cdl >> >> etc. >> >> >> Additional instructions if you want to grab the content files. >> >> From any book's metadata page (e.g., >> http://www.archive.org/details/chemicallecturee00newtrich) >> click through on the "Usage Rights: See Terms" link; the rights are on > a >> pane on the left-hand >> side. >> >> Once you know the identifier, you can grab the content files, using > this >> syntax: >> http://www.archive.org/details/$ID >> Like so: >> http://www.archive.org/details/chemicallecturee00newtrich >> >> And then sniff the page to find the FTP link: >> ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >> >> But I think they prefer to use HTTP for these, not the FTP, so switch >> this to: >> http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >> >> Hope this helps! >> >> --SET >> >> >>> We're a contributer so I can use curl to grab our records via http >> (and >>> regexp my way to our local catalog identifiers, which they do >>> store/expose). >>> >>> I've played a bit with the z39.50 interface at indexdata >>> (http://www.indexdata.dk/opencontent/), but I'm not confident about >> the >>> content behind it. I get very limited results, for instance I can't >> find >>> any UNC records and we're fairly new to the game. >>> >>> Again, I'm looking for unique identifiers in what I can get back and >> it's >>> slim pickings. >>> >>> Anyone cracked this nut? Got any life lessons for me? >>> >>> Thanks! >>> Tim >>> >>> +++++++++++++++++++++++++++++++++++++++++++ >>> Tim Shearer >>> >>> Web Development Coordinator >>> The University Library >>> University of North Carolina at Chapel Hill >>> [log in to unmask] >>> 919-962-1288 >>> +++++++++++++++++++++++++++++++++++++++++++ >>> >> --