Great idea, Tim! The open library tech list that Bess mentions is [log in to unmask], described at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076 >-----Original Message----- >From: Code for Libraries [mailto:[log in to unmask]] On >Behalf Of Tim Shearer >Sent: Thursday, March 06, 2008 8:47 AM >To: [log in to unmask] >Subject: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api? > >Howdy folks, > >I've been playing and thinking. I'd like to have what amounts >to a unique >identifier index to oca digitized texts. I want to be able to >pull all the >records that have oclc numbers, issns, isbns, etc. I want it to be >lightweight, fast, searchable. > >Would anyone else want/use such a thing? > >I'm thinking about building something like this. > >If I do, it would be ideal if wouldn't be a duplication of >effort, so anyone >got this in the works? And if it would meet the needs of others. > >My basic notion is to crawl the site (starting with >"americana", the American >Libraries. Pull the oca unique identifier (e.g. >northcarolinayea1910rale) and >associate it with > >unique identifiers (oclc numbers, issns, isbns, lc numbers) >contributing institution's alias and unique catalog identifier >upload date > >That's all I was thinking of. Then there's what you might be >able to do with >it: > > Give me all the oca unique identifiers that have oclc numbers > Give me all the oca unique identifiers with isbns that were > uploaded between x and y date > Give me the oca unique identifier for this oclc number > >Planning to do: > > keep crawling it and keep it up to date. > >Things I wasn't planning to do: > > worry about other unique ids (you'd have to go to xISBN or > ThingISBN yourself) > worry about storing anything else from oca. > >It would be good for being able to add an 856 to matches in >your catalog. It >would not be good for grabbing all marc records for all of oca. > >Anyhow, is this duplication of effort? Would you like >something like this? >What else would you like it to do (keeping in mind this is an >unfunded pet >project)? How would you want to talk to it? I was thinking >of a web service, >but hadn't thought too much about how to query it or how I'd >deliver results. > >Of course I'm being an idiot and trying out new tools at the >same time (python >to see what the buzz is all about, sqlite just to learn it (it >may not work >out)). > >Thoughts? Vicious criticism? > >-t > > >On Tue, 26 Feb 2008, Chris Freeland wrote: > >> My guess is that, yes, the query interface we've been discussing here >> and the 'all sorts of interfaces that none of us knew about' are the >> same. It's not documented that I'm aware of. We've found >out about it >> by literally sitting next to IA developers and asking questions. >> >> Chris >> -----Original Message----- >> From: Code for Libraries [mailto:[log in to unmask]] >On Behalf Of >> Jonathan Rochkind >> Sent: Tuesday, February 26, 2008 12:18 PM >> To: [log in to unmask] >> Subject: Re: [CODE4LIB] oca api? >> >> So in answer to my question here at the Code4Lib conference, after >> Brewster's keynote, Brewster suggests there are all sorts of >interfaces >> that none of us knew about. Or at least I didn't know about, >and haven't >> been able to figure out in months of trying! I'm going to try and >> corner him and ask for an email of who we should contact. >> >> Perhaps it's the XML interface that you guys know about >already. Is that >> documented anywhere? How the heck did you find out about it? >> >> Jonathan >> >> >>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>> >> I'll add that when IA told me about >> http://www.archive.org/services/search.php interface to return >> XML, they asked that we not send more than 100 records at time since >> doing more would adversely >> affect production services. Which made it seem like OAI-PMH >was a better >> way to go. >> >> Chris, can you explain a bit more about what this means: "We >found their >> OAI interface to pull >> scanned items inconsistently based on date of scanning...."? >I'm having >> trouble parsing. >> >> >> --SET >> >> >> >> >> --- Chris Freeland <[log in to unmask]> wrote: >> >>> Jonathan - No, I don't believe it's documented - at least >not anywhere >>> publicly. If any IA/OCA folks are lurking, here's an opportunity to >>> make a bunch of techies happy... >>> >>> Chris >>> >>> -----Original Message----- >>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf >> Of >>> Jonathan Rochkind >>> Sent: Monday, February 25, 2008 2:48 PM >>> To: [log in to unmask] >>> Subject: Re: [CODE4LIB] oca api? >>> >>> I hadn't known this "custom query interface" existed! This >is welcome >>> news. Is this documented anywhere? >>> >>> Jonathan >>> >>> >>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>> >>> Steve & Tim, >>> >>> I'm the tech director for the Biodiversity Heritage Library (BHL), >> which >>> is a consortium of 10 natural history libraries who have partnered >> with >>> Internet Archive (IA)/OCA for scanning our collections. We've just >>> launched our revamped portal, complete with more than 7,500 books & >> 2.8 >>> million pages scanned by IA & other digitization partners, at: >>> http://www.biodiversitylibrary.org >>> >>> To build this portal we ingest metadata from IA. We found their OAI >>> interface to pull scanned items inconsistently based on date of >>> scanning, so we switched to using their custom query interface. >> Here's >>> an example of a query we fire off: >>> >>> >> >http://www.archive.org/services/search.php?query=collection:(bi odiversit >>> >> >y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contribut or:(MBLWH >>> OI%20Library)&limit=10&submit=submit >>> >>> This is returning scanned items from the "biodiversity" collection, >>> updated between 10/31/2007 - 11/30/2007, restricted to one of our >>> contributing libraries (MBLWHOI Library), and limited to 10 results. >>> >>> The results are styled in the browser; view source to see the good >>> stuff. We use this list to grab the identifiers we've yet >to ingest. >>> >>> Some background: When a book is scanned through IA/OCA >scanning, they >>> create their own unique identifier (like >"annalesacademiae21univ") and >>> grab a MARC record from the contributing library's catalog. All of >> the >>> scanned files, derivatives, and metadata files are stored on IA's >>> clusters in a directory named with the identifier. >>> >>> Steve mentioned using their /details/ directive, then sniffing the >> page >>> to get the cluster location and the files for downloading. >An easier >>> method is to use their /download/ directive, as in: >>> >>> http://www.archive.org/download/ID$, or in the example above: >>> http://www.archive.org/download/annalesacademiae21univ >>> >>> That automatically does a lookup on the cluster, which >means you don't >>> have to scrape info off pages. You can also address any >files within >>> that directory, as in: >>> >> >http://www.archive.org/download/annalesacademiae21univ/annalesa >cademiae2 >>> 1univ_marc.xml >>> >>> The only way to get standard identifiers (ISBN, ISSN, OCLC, >LCCN) for >>> these scanned books is to grab them out of the MARC record. So the >>> long-winded answer to your question, Tim, is no, there's no >simple way >>> to crossref what IA has scanned with your catalog - THAT I KNOW OF. >> Big >>> caveat on that last part. >>> >>> Happy to help with any other questions I can, >>> >>> Chris Freeland >>> >>> >>> -----Original Message----- >>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf >> Of >>> Steve Toub >>> Sent: Sunday, February 24, 2008 11:20 PM >>> To: [log in to unmask] >>> Subject: Re: [CODE4LIB] oca api? >>> >>> --- Tim Shearer <[log in to unmask]> wrote: >>> >>>> Hi Folks, >>>> >>>> I'm looking into tapping the texts in the Open Content Alliance. >>>> >>>> A few questions... >>>> >>>> As near as I can tell, they don't expose (perhaps even store?) any >>> common >>>> unique identifiers (oclc number, issn, isbn, loc number). >>> >>> I poked around in this world a few months ago in my previous job at >>> California Digital Library, >>> also an OCA partner. >>> >>> The unique key seems to be text string identifier (one that seems to >> be >>> completely different from >>> the text string identifier in Open Library). Apparently >there was talk >>> at the last partner meeting >>> about moving to ISBNs: >>> >> >http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open- >content-a >>> lliance/ >>> >>> To obtain identifiers in bulk, I think the recommended >approach is the >>> OAI-PMH interface, which >>> seems more reliable in recent months: >>> >>> http://www.archive.org/services/oai.php?verb=Identify >>> >>> >> >http://www.archive.org/services/oai.php?verb=ListIdentifiers&me tadataPre >>> fix=oai_dc&set=collection:cdl >>> >>> etc. >>> >>> >>> Additional instructions if you want to grab the content files. >>> >>> From any book's metadata page (e.g., >>> http://www.archive.org/details/chemicallecturee00newtrich) >>> click through on the "Usage Rights: See Terms" link; the >rights are on >> a >>> pane on the left-hand >>> side. >>> >>> Once you know the identifier, you can grab the content files, using >> this >>> syntax: >>> http://www.archive.org/details/$ID >>> Like so: >>> http://www.archive.org/details/chemicallecturee00newtrich >>> >>> And then sniff the page to find the FTP link: >>> ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >>> >>> But I think they prefer to use HTTP for these, not the FTP, >so switch >>> this to: >>> >http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich >>> >>> Hope this helps! >>> >>> --SET >>> >>> >>>> We're a contributer so I can use curl to grab our records via http >>> (and >>>> regexp my way to our local catalog identifiers, which they do >>>> store/expose). >>>> >>>> I've played a bit with the z39.50 interface at indexdata >>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about >>> the >>>> content behind it. I get very limited results, for >instance I can't >>> find >>>> any UNC records and we're fairly new to the game. >>>> >>>> Again, I'm looking for unique identifiers in what I can >get back and >>> it's >>>> slim pickings. >>>> >>>> Anyone cracked this nut? Got any life lessons for me? >>>> >>>> Thanks! >>>> Tim >>>> >>>> +++++++++++++++++++++++++++++++++++++++++++ >>>> Tim Shearer >>>> >>>> Web Development Coordinator >>>> The University Library >>>> University of North Carolina at Chapel Hill >>>> [log in to unmask] >>>> 919-962-1288 >>>> +++++++++++++++++++++++++++++++++++++++++++ >>>> >>> >> >