Because the IA hasn't devoted resources to documenting this stuff, I guess. If they actually want their stuff to be used by folks like us, then seems to me resources devoted to such would be resources well spent. Jonathan K.G. Schneider wrote: > But why are there hurdles? > > Karen G. Schneider > > On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland" > <[log in to unmask]> said: > >> Roy, do you have an answer in mind? >> >> To me & my project it's the content that is open, which is why it's worth >> the hurdles. Once you 'crack the nut' you can grab metadata, scans, and >> derivatives and ingest, parse, recombine, remix...as we've done for BHL. >> >> Access to OCA content may not be standards-based, but it works. >> >> Chris >> >> -----Original Message----- >> From: "Roy Tennant" <[log in to unmask]> >> To: "[log in to unmask]" <[log in to unmask]> >> Sent: 2/27/2008 5:28 AM >> Subject: Re: [CODE4LIB] oca api? >> >> So what, exactly, is "open" about this? Anyone care to guess? >> Roy >> >> >> On 2/26/08 10:29 AM, "Chris Freeland" <[log in to unmask]> wrote: >> >> >>> My guess is that, yes, the query interface we've been discussing here >>> and the 'all sorts of interfaces that none of us knew about' are the >>> same. It's not documented that I'm aware of. We've found out about it >>> by literally sitting next to IA developers and asking questions. >>> >>> Chris >>> -----Original Message----- >>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of >>> Jonathan Rochkind >>> Sent: Tuesday, February 26, 2008 12:18 PM >>> To: [log in to unmask] >>> Subject: Re: [CODE4LIB] oca api? >>> >>> So in answer to my question here at the Code4Lib conference, after >>> Brewster's keynote, Brewster suggests there are all sorts of interfaces >>> that none of us knew about. Or at least I didn't know about, and haven't >>> been able to figure out in months of trying! I'm going to try and >>> corner him and ask for an email of who we should contact. >>> >>> Perhaps it's the XML interface that you guys know about already. Is that >>> documented anywhere? How the heck did you find out about it? >>> >>> Jonathan >>> >>> >>> >>>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>> >>>>>> >>> I'll add that when IA told me about >>> http://www.archive.org/services/search.php interface to return >>> XML, they asked that we not send more than 100 records at time since >>> doing more would adversely >>> affect production services. Which made it seem like OAI-PMH was a better >>> way to go. >>> >>> Chris, can you explain a bit more about what this means: "We found their >>> OAI interface to pull >>> scanned items inconsistently based on date of scanning...."? I'm having >>> trouble parsing. >>> >>> >>> --SET >>> >>> >>> >>> >>> --- Chris Freeland <[log in to unmask]> wrote: >>> >>> >>>> Jonathan - No, I don't believe it's documented - at least not anywhere >>>> publicly. If any IA/OCA folks are lurking, here's an opportunity to >>>> make a bunch of techies happy... >>>> >>>> Chris >>>> >>>> -----Original Message----- >>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf >>>> >>> Of >>> >>>> Jonathan Rochkind >>>> Sent: Monday, February 25, 2008 2:48 PM >>>> To: [log in to unmask] >>>> Subject: Re: [CODE4LIB] oca api? >>>> >>>> I hadn't known this "custom query interface" existed! This is welcome >>>> news. Is this documented anywhere? >>>> >>>> Jonathan >>>> >>>> >>>> >>>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>> >>>>>>> >>>> Steve & Tim, >>>> >>>> I'm the tech director for the Biodiversity Heritage Library (BHL), >>>> >>> which >>> >>>> is a consortium of 10 natural history libraries who have partnered >>>> >>> with >>> >>>> Internet Archive (IA)/OCA for scanning our collections. We've just >>>> launched our revamped portal, complete with more than 7,500 books & >>>> >>> 2.8 >>> >>>> million pages scanned by IA & other digitization partners, at: >>>> http://www.biodiversitylibrary.org >>>> >>>> To build this portal we ingest metadata from IA. We found their OAI >>>> interface to pull scanned items inconsistently based on date of >>>> scanning, so we switched to using their custom query interface. >>>> >>> Here's >>> >>>> an example of a query we fire off: >>>> >>>> >>>> >>> http://www.archive.org/services/search.php?query=collection:(biodiversit >>> >>> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH >>> >>>> OI%20Library)&limit=10&submit=submit >>>> >>>> This is returning scanned items from the "biodiversity" collection, >>>> updated between 10/31/2007 - 11/30/2007, restricted to one of our >>>> contributing libraries (MBLWHOI Library), and limited to 10 results. >>>> >>>> The results are styled in the browser; view source to see the good >>>> stuff. We use this list to grab the identifiers we've yet to ingest. >>>> >>>> Some background: When a book is scanned through IA/OCA scanning, they >>>> create their own unique identifier (like "annalesacademiae21univ") and >>>> grab a MARC record from the contributing library's catalog. All of >>>> >>> the >>> >>>> scanned files, derivatives, and metadata files are stored on IA's >>>> clusters in a directory named with the identifier. >>>> >>>> Steve mentioned using their /details/ directive, then sniffing the >>>> >>> page >>> >>>> to get the cluster location and the files for downloading. An easier >>>> method is to use their /download/ directive, as in: >>>> >>>> http://www.archive.org/download/ID$, or in the example above: >>>> http://www.archive.org/download/annalesacademiae21univ >>>> >>>> That automatically does a lookup on the cluster, which means you don't >>>> have to scrape info off pages. You can also address any files within >>>> that directory, as in: >>>> >>>> >>> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 >>> >>>> 1univ_marc.xml >>>> >>>> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for >>>> these scanned books is to grab them out of the MARC record. So the >>>> long-winded answer to your question, Tim, is no, there's no simple way >>>> to crossref what IA has scanned with your catalog - THAT I KNOW OF. >>>> >>> Big >>> >>>> caveat on that last part. >>>> >>>> Happy to help with any other questions I can, >>>> >>>> Chris Freeland >>>> >>>> >>>> -----Original Message----- >>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf >>>> >>> Of >>> >>>> Steve Toub >>>> Sent: Sunday, February 24, 2008 11:20 PM >>>> To: [log in to unmask] >>>> Subject: Re: [CODE4LIB] oca api? >>>> >>>> --- Tim Shearer <[log in to unmask]> wrote: >>>> >>>> >>>>> Hi Folks, >>>>> >>>>> I'm looking into tapping the texts in the Open Content Alliance. >>>>> >>>>> A few questions... >>>>> >>>>> As near as I can tell, they don't expose (perhaps even store?) any >>>>> >>>> common >>>> >>>>> unique identifiers (oclc number, issn, isbn, loc number). >>>>> >>>> I poked around in this world a few months ago in my previous job at >>>> California Digital Library, >>>> also an OCA partner. >>>> >>>> The unique key seems to be text string identifier (one that seems to >>>> >>> be >>> >>>> completely different from >>>> the text string identifier in Open Library). Apparently there was talk >>>> at the last partner meeting >>>> about moving to ISBNs: >>>> >>>> >>> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a >>> >>>> lliance/ >>>> >>>> To obtain identifiers in bulk, I think the recommended approach is the >>>> OAI-PMH interface, which >>>> seems more reliable in recent months: >>>> >>>> http://www.archive.org/services/oai.php?verb=Identify >>>> >>>> >>>> >>> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre >>> >>>> fix=oai_dc&set=collection:cdl >>>> >>>> etc. >>>> >>>> >>>> Additional instructions if you want to grab the content files. >>>> >>>> From any book's metadata page (e.g., >>>> http://www.archive.org/details/chemicallecturee00newtrich) >>>> click through on the "Usage Rights: See Terms" link; the rights are on >>>> >>> a >>> >>>> pane on the left-hand >>>> side. >>>> >>>> Once you know the identifier, you can grab the content files, using >>>> >>> this >>> >>>> syntax: >>>> http://www.archive.org/details/$ID >>>> Like so: >>>> http://www.archive.org/details/chemicallecturee00newtrich >>>> >>>> And then sniff the page to find the FTP link: >>>> ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >>>> >>>> But I think they prefer to use HTTP for these, not the FTP, so switch >>>> this to: >>>> http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >>>> >>>> Hope this helps! >>>> >>>> --SET >>>> >>>> >>>> >>>>> We're a contributer so I can use curl to grab our records via http >>>>> >>>> (and >>>> >>>>> regexp my way to our local catalog identifiers, which they do >>>>> store/expose). >>>>> >>>>> I've played a bit with the z39.50 interface at indexdata >>>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about >>>>> >>>> the >>>> >>>>> content behind it. I get very limited results, for instance I can't >>>>> >>>> find >>>> >>>>> any UNC records and we're fairly new to the game. >>>>> >>>>> Again, I'm looking for unique identifiers in what I can get back and >>>>> >>>> it's >>>> >>>>> slim pickings. >>>>> >>>>> Anyone cracked this nut? Got any life lessons for me? >>>>> >>>>> Thanks! >>>>> Tim >>>>> >>>>> +++++++++++++++++++++++++++++++++++++++++++ >>>>> Tim Shearer >>>>> >>>>> Web Development Coordinator >>>>> The University Library >>>>> University of North Carolina at Chapel Hill >>>>> [log in to unmask] >>>>> 919-962-1288 >>>>> +++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >> -- >> > > -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu