Yup, Chris' email was exactly what I was hoping for. Now if there were a nice way to pre-screen for records that don't have empty (isbn|issn|oclc#) without all the work of looking per record (and the overhead for the server, and the overhead if more than one organization starts to do this). I guess I want to search for uniqueID != NULL and only get their unique id back, and script from there. Still and all, this now seems a very doable thing. Chris, many thanks! -t On Mon, 25 Feb 2008, Tennant,Roy wrote: > Well, from where Chris left off it would be fairly easy to check for a > file in the directory with an "marc.xml" filename extension, then XSLT > for: > > <datafield tag="010" ind1=" " ind2=" "> > <subfield code="a">39004822</subfield> > </datafield> > > If such exists, and then you'll have the ISBN. To sweeten it further, > send that into xISBN or ThingISBN and get other ISBNs for the same work. > This seems completely scriptable to me. Perhaps someone at c4l will have > it done before the conference is over. And Tim, the example above is one > that's in your catalog. > Roy > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Chris Freeland > Sent: Monday, February 25, 2008 11:51 AM > To: [log in to unmask] > Subject: Re: [CODE4LIB] oca api? > > Steve & Tim, > > I'm the tech director for the Biodiversity Heritage Library (BHL), which > is a consortium of 10 natural history libraries who have partnered with > Internet Archive (IA)/OCA for scanning our collections. We've just > launched our revamped portal, complete with more than 7,500 books & 2.8 > million pages scanned by IA & other digitization partners, at: > http://www.biodiversitylibrary.org > > To build this portal we ingest metadata from IA. We found their OAI > interface to pull scanned items inconsistently based on date of > scanning, so we switched to using their custom query interface. Here's > an example of a query we fire off: > > http://www.archive.org/services/search.php?query=collection:(biodiversit > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > OI%20Library)&limit=10&submit=submit > > This is returning scanned items from the "biodiversity" collection, > updated between 10/31/2007 - 11/30/2007, restricted to one of our > contributing libraries (MBLWHOI Library), and limited to 10 results. > > The results are styled in the browser; view source to see the good > stuff. We use this list to grab the identifiers we've yet to ingest. > > Some background: When a book is scanned through IA/OCA scanning, they > create their own unique identifier (like "annalesacademiae21univ") and > grab a MARC record from the contributing library's catalog. All of the > scanned files, derivatives, and metadata files are stored on IA's > clusters in a directory named with the identifier. > > Steve mentioned using their /details/ directive, then sniffing the page > to get the cluster location and the files for downloading. An easier > method is to use their /download/ directive, as in: > > http://www.archive.org/download/ID$, or in the example above: > http://www.archive.org/download/annalesacademiae21univ > > That automatically does a lookup on the cluster, which means you don't > have to scrape info off pages. You can also address any files within > that directory, as in: > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > 1univ_marc.xml > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > these scanned books is to grab them out of the MARC record. So the > long-winded answer to your question, Tim, is no, there's no simple way > to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big > caveat on that last part. > > Happy to help with any other questions I can, > > Chris Freeland > > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Steve Toub > Sent: Sunday, February 24, 2008 11:20 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] oca api? > > --- Tim Shearer <[log in to unmask]> wrote: > >> Hi Folks, >> >> I'm looking into tapping the texts in the Open Content Alliance. >> >> A few questions... >> >> As near as I can tell, they don't expose (perhaps even store?) any > common >> unique identifiers (oclc number, issn, isbn, loc number). > > I poked around in this world a few months ago in my previous job at > California Digital Library, also an OCA partner. > > The unique key seems to be text string identifier (one that seems to be > completely different from the text string identifier in Open Library). > Apparently there was talk at the last partner meeting about moving to > ISBNs: > http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a > lliance/ > > To obtain identifiers in bulk, I think the recommended approach is the > OAI-PMH interface, which seems more reliable in recent months: > > http://www.archive.org/services/oai.php?verb=Identify > > http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre > fix=oai_dc&set=collection:cdl > > etc. > > > Additional instructions if you want to grab the content files. > >> From any book's metadata page (e.g., > http://www.archive.org/details/chemicallecturee00newtrich) > click through on the "Usage Rights: See Terms" link; the rights are on a > pane on the left-hand side. > > Once you know the identifier, you can grab the content files, using this > syntax: > http://www.archive.org/details/$ID > Like so: > http://www.archive.org/details/chemicallecturee00newtrich > > And then sniff the page to find the FTP link: > ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > > But I think they prefer to use HTTP for these, not the FTP, so switch > this to: > http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich > > Hope this helps! > > --SET > > >> We're a contributer so I can use curl to grab our records via http > (and >> regexp my way to our local catalog identifiers, which they do >> store/expose). >> >> I've played a bit with the z39.50 interface at indexdata >> (http://www.indexdata.dk/opencontent/), but I'm not confident about > the >> content behind it. I get very limited results, for instance I can't > find >> any UNC records and we're fairly new to the game. >> >> Again, I'm looking for unique identifiers in what I can get back and > it's >> slim pickings. >> >> Anyone cracked this nut? Got any life lessons for me? >> >> Thanks! >> Tim >> >> +++++++++++++++++++++++++++++++++++++++++++ >> Tim Shearer >> >> Web Development Coordinator >> The University Library >> University of North Carolina at Chapel Hill [log in to unmask] >> 919-962-1288 >> +++++++++++++++++++++++++++++++++++++++++++ >> >