But why are there hurdles? Karen G. Schneider On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland" <[log in to unmask]> said: > Roy, do you have an answer in mind? > > To me & my project it's the content that is open, which is why it's worth > the hurdles. Once you 'crack the nut' you can grab metadata, scans, and > derivatives and ingest, parse, recombine, remix...as we've done for BHL. > > Access to OCA content may not be standards-based, but it works. > > Chris > > -----Original Message----- > From: "Roy Tennant" <[log in to unmask]> > To: "[log in to unmask]" <[log in to unmask]> > Sent: 2/27/2008 5:28 AM > Subject: Re: [CODE4LIB] oca api? > > So what, exactly, is "open" about this? Anyone care to guess? > Roy > > > On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote: > > > My guess is that, yes, the query interface we've been discussing here > > and the 'all sorts of interfaces that none of us knew about' are the > > same. It's not documented that I'm aware of. We've found out about it > > by literally sitting next to IA developers and asking questions. > > > > Chris > > -----Original Message----- > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > > Jonathan Rochkind > > Sent: Tuesday, February 26, 2008 12:18 PM > > To: [log in to unmask] > > Subject: Re: [CODE4LIB] oca api? > > > > So in answer to my question here at the Code4Lib conference, after > > Brewster's keynote, Brewster suggests there are all sorts of interfaces > > that none of us knew about. Or at least I didn't know about, and haven't > > been able to figure out in months of trying! I'm going to try and > > corner him and ask for an email of who we should contact. > > > > Perhaps it's the XML interface that you guys know about already. Is that > > documented anywhere? How the heck did you find out about it? > > > > Jonathan > > > > > >>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>> > > I'll add that when IA told me about > > http://www.archive.org/services/search.php interface to return > > XML, they asked that we not send more than 100 records at time since > > doing more would adversely > > affect production services. Which made it seem like OAI-PMH was a better > > way to go. > > > > Chris, can you explain a bit more about what this means: "We found their > > OAI interface to pull > > scanned items inconsistently based on date of scanning...."? I'm having > > trouble parsing. > > > > > > --SET > > > > > > > > > > --- Chris Freeland <[log in to unmask]> wrote: > > > >> Jonathan - No, I don't believe it's documented - at least not anywhere > >> publicly. If any IA/OCA folks are lurking, here's an opportunity to > >> make a bunch of techies happy... > >> > >> Chris > >> > >> -----Original Message----- > >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf > > Of > >> Jonathan Rochkind > >> Sent: Monday, February 25, 2008 2:48 PM > >> To: [log in to unmask] > >> Subject: Re: [CODE4LIB] oca api? > >> > >> I hadn't known this "custom query interface" existed! This is welcome > >> news. Is this documented anywhere? > >> > >> Jonathan > >> > >> > >>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>> > >> Steve & Tim, > >> > >> I'm the tech director for the Biodiversity Heritage Library (BHL), > > which > >> is a consortium of 10 natural history libraries who have partnered > > with > >> Internet Archive (IA)/OCA for scanning our collections. We've just > >> launched our revamped portal, complete with more than 7,500 books & > > 2.8 > >> million pages scanned by IA & other digitization partners, at: > >> http://www.biodiversitylibrary.org > >> > >> To build this portal we ingest metadata from IA. We found their OAI > >> interface to pull scanned items inconsistently based on date of > >> scanning, so we switched to using their custom query interface. > > Here's > >> an example of a query we fire off: > >> > >> > > http://www.archive.org/services/search.php?query=collection:(biodiversit > >> > > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > >> OI%20Library)&limit=10&submit=submit > >> > >> This is returning scanned items from the "biodiversity" collection, > >> updated between 10/31/2007 - 11/30/2007, restricted to one of our > >> contributing libraries (MBLWHOI Library), and limited to 10 results. > >> > >> The results are styled in the browser; view source to see the good > >> stuff. We use this list to grab the identifiers we've yet to ingest. > >> > >> Some background: When a book is scanned through IA/OCA scanning, they > >> create their own unique identifier (like "annalesacademiae21univ") and > >> grab a MARC record from the contributing library's catalog. All of > > the > >> scanned files, derivatives, and metadata files are stored on IA's > >> clusters in a directory named with the identifier. > >> > >> Steve mentioned using their /details/ directive, then sniffing the > > page > >> to get the cluster location and the files for downloading. An easier > >> method is to use their /download/ directive, as in: > >> > >> http://www.archive.org/download/ID$, or in the example above: > >> http://www.archive.org/download/annalesacademiae21univ > >> > >> That automatically does a lookup on the cluster, which means you don't > >> have to scrape info off pages. You can also address any files within > >> that directory, as in: > >> > > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > >> 1univ_marc.xml > >> > >> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > >> these scanned books is to grab them out of the MARC record. So the > >> long-winded answer to your question, Tim, is no, there's no simple way > >> to crossref what IA has scanned with your catalog - THAT I KNOW OF. > > Big > >> caveat on that last part. > >> > >> Happy to help with any other questions I can, > >> > >> Chris Freeland > >> > >> > >> -----Original Message----- > >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf > > Of > >> Steve Toub > >> Sent: Sunday, February 24, 2008 11:20 PM > >> To: [log in to unmask] > >> Subject: Re: [CODE4LIB] oca api? > >> > >> --- Tim Shearer <[log in to unmask]> wrote: > >> > >>> Hi Folks, > >>> > >>> I'm looking into tapping the texts in the Open Content Alliance. > >>> > >>> A few questions... > >>> > >>> As near as I can tell, they don't expose (perhaps even store?) any > >> common > >>> unique identifiers (oclc number, issn, isbn, loc number). > >> > >> I poked around in this world a few months ago in my previous job at > >> California Digital Library, > >> also an OCA partner. > >> > >> The unique key seems to be text string identifier (one that seems to > > be > >> completely different from > >> the text string identifier in Open Library). Apparently there was talk > >> at the last partner meeting > >> about moving to ISBNs: > >> > > http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a > >> lliance/ > >> > >> To obtain identifiers in bulk, I think the recommended approach is the > >> OAI-PMH interface, which > >> seems more reliable in recent months: > >> > >> http://www.archive.org/services/oai.php?verb=Identify > >> > >> > > http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre > >> fix=oai_dc&set=collection:cdl > >> > >> etc. > >> > >> > >> Additional instructions if you want to grab the content files. > >> > >> From any book's metadata page (e.g., > >> http://www.archive.org/details/chemicallecturee00newtrich) > >> click through on the "Usage Rights: See Terms" link; the rights are on > > a > >> pane on the left-hand > >> side. > >> > >> Once you know the identifier, you can grab the content files, using > > this > >> syntax: > >> http://www.archive.org/details/$ID > >> Like so: > >> http://www.archive.org/details/chemicallecturee00newtrich > >> > >> And then sniff the page to find the FTP link: > >> ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > >> > >> But I think they prefer to use HTTP for these, not the FTP, so switch > >> this to: > >> http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > >> > >> Hope this helps! > >> > >> --SET > >> > >> > >>> We're a contributer so I can use curl to grab our records via http > >> (and > >>> regexp my way to our local catalog identifiers, which they do > >>> store/expose). > >>> > >>> I've played a bit with the z39.50 interface at indexdata > >>> (http://www.indexdata.dk/opencontent/), but I'm not confident about > >> the > >>> content behind it. I get very limited results, for instance I can't > >> find > >>> any UNC records and we're fairly new to the game. > >>> > >>> Again, I'm looking for unique identifiers in what I can get back and > >> it's > >>> slim pickings. > >>> > >>> Anyone cracked this nut? Got any life lessons for me? > >>> > >>> Thanks! > >>> Tim > >>> > >>> +++++++++++++++++++++++++++++++++++++++++++ > >>> Tim Shearer > >>> > >>> Web Development Coordinator > >>> The University Library > >>> University of North Carolina at Chapel Hill > >>> [log in to unmask] > >>> 919-962-1288 > >>> +++++++++++++++++++++++++++++++++++++++++++ > >>> > >> > > --