Print

Print


I would absolutely want and use such a thing.

I don't know of anyone else doing that, although I have been thinking
about it too (but don't really have time to do much with it). The
approach and issues you have identified matches what I've been thinking,
and I don't have much additional to add.

Are you thinking of providing an index that you'd let the rest of us
search?  That would be great. Although there's always an issue with
sustainability there; if I have my software use your index, what happens
when you leave your job and your employer stops supporting it? It might
make sense to try to find a more "neutral" host site for such a thing,
and try to get together a small 'committee' to support it, so if you
stop working on it for whatever reason a year from now, it is more
likely to continue to work.

Jonathan

Tim Shearer wrote:
> Howdy folks,
>
> I've been playing and thinking.  I'd like to have what amounts to a
> unique
> identifier index to oca digitized texts.  I want to be able to pull
> all the
> records that have oclc numbers, issns, isbns, etc.  I want it to be
> lightweight, fast, searchable.
>
> Would anyone else want/use such a thing?
>
> I'm thinking about building something like this.
>
> If I do, it would be ideal if wouldn't be a duplication of effort, so
> anyone
> got this in the works?  And if it would meet the needs of others.
>
> My basic notion is to crawl the site (starting with "americana", the
> American
> Libraries.  Pull the oca unique identifier (e.g.
> northcarolinayea1910rale) and
> associate it with
>
> unique identifiers (oclc numbers, issns, isbns, lc numbers)
> contributing institution's alias and unique catalog identifier
> upload date
>
> That's all I was thinking of.  Then there's what you might be able to
> do with
> it:
>
>        Give me all the oca unique identifiers that have oclc numbers
>        Give me all the oca unique identifiers with isbns that were
>                uploaded between x and y date
>        Give me the oca unique identifier for this oclc number
>
> Planning to do:
>
>        keep crawling it and keep it up to date.
>
> Things I wasn't planning to do:
>
>        worry about other unique ids (you'd have to go to xISBN or
>                ThingISBN yourself)
>        worry about storing anything else from oca.
>
> It would be good for being able to add an 856 to matches in your
> catalog. It
> would not be good for grabbing all marc records for all of oca.
>
> Anyhow, is this duplication of effort?  Would you like something like
> this?
> What else would you like it to do (keeping in mind this is an unfunded
> pet
> project)?  How would you want to talk to it?  I was thinking of a web
> service,
> but hadn't thought too much about how to query it or how I'd deliver
> results.
>
> Of course I'm being an idiot and trying out new tools at the same time
> (python
> to see what the buzz is all about, sqlite just to learn it (it may not
> work
> out)).
>
> Thoughts?  Vicious criticism?
>
> -t
>
>
> On Tue, 26 Feb 2008, Chris Freeland wrote:
>
>> My guess is that, yes, the query interface we've been discussing here
>> and the 'all sorts of interfaces that none of us knew about' are the
>> same.  It's not documented that I'm aware of.  We've found out about it
>> by literally sitting next to IA developers and asking questions.
>>
>> Chris
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> Jonathan Rochkind
>> Sent: Tuesday, February 26, 2008 12:18 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] oca api?
>>
>> So in answer to my question here at the Code4Lib conference, after
>> Brewster's keynote, Brewster suggests there are all sorts of interfaces
>> that none of us knew about. Or at least I didn't know about, and haven't
>> been able to figure out in months of trying!  I'm going to try and
>> corner him and ask for an email of who we should contact.
>>
>> Perhaps it's the XML interface that you guys know about already. Is that
>> documented anywhere? How the heck did you find out about it?
>>
>> Jonathan
>>
>>
>>>>> Steve Toub <[log in to unmask]> 02/25/08 9:41 PM >>>
>> I'll add that when IA told me about
>> http://www.archive.org/services/search.php interface to return
>> XML, they asked that we not send more than 100 records at time since
>> doing more would adversely
>> affect production services. Which made it seem like OAI-PMH was a better
>> way to go.
>>
>> Chris, can you explain a bit more about what this means: "We found their
>> OAI interface to pull
>> scanned items inconsistently based on date of scanning...."? I'm having
>> trouble parsing.
>>
>>
>>   --SET
>>
>>
>>
>>
>> --- Chris Freeland <[log in to unmask]> wrote:
>>
>>> Jonathan - No, I don't believe it's documented - at least not anywhere
>>> publicly.  If any IA/OCA folks are lurking, here's an opportunity to
>>> make a bunch of techies happy...
>>>
>>> Chris
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>> Of
>>> Jonathan Rochkind
>>> Sent: Monday, February 25, 2008 2:48 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] oca api?
>>>
>>> I hadn't known this "custom query interface" existed! This is welcome
>>> news. Is this documented anywhere?
>>>
>>> Jonathan
>>>
>>>
>>>>>> Chris Freeland <[log in to unmask]> 02/25/08 2:51 PM >>>
>>> Steve & Tim,
>>>
>>> I'm the tech director for the Biodiversity Heritage Library (BHL),
>> which
>>> is a consortium of 10 natural history libraries who have partnered
>> with
>>> Internet Archive (IA)/OCA for scanning our collections.  We've just
>>> launched our revamped portal, complete with more than 7,500 books &
>> 2.8
>>> million pages scanned by IA & other digitization partners, at:
>>> http://www.biodiversitylibrary.org
>>>
>>> To build this portal we ingest metadata from IA.  We found their OAI
>>> interface to pull scanned items inconsistently based on date of
>>> scanning, so we switched to using their custom query interface.
>> Here's
>>> an example of a query we fire off:
>>>
>>>
>> http://www.archive.org/services/search.php?query=collection:(biodiversit
>>>
>> y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
>>> OI%20Library)&limit=10&submit=submit
>>>
>>> This is returning scanned items from the "biodiversity" collection,
>>> updated between 10/31/2007 - 11/30/2007, restricted to one of our
>>> contributing libraries (MBLWHOI Library), and limited to 10 results.
>>>
>>> The results are styled in the browser; view source to see the good
>>> stuff.  We use this list to grab the identifiers we've yet to ingest.
>>>
>>> Some background: When a book is scanned through IA/OCA scanning, they
>>> create their own unique identifier (like "annalesacademiae21univ") and
>>> grab a MARC record from the contributing library's catalog.  All of
>> the
>>> scanned files, derivatives, and metadata files are stored on IA's
>>> clusters in a directory named with the identifier.
>>>
>>> Steve mentioned using their /details/ directive, then sniffing the
>> page
>>> to get the cluster location and the files for downloading.  An easier
>>> method is to use their /download/ directive, as in:
>>>
>>> http://www.archive.org/download/ID$, or in the example above:
>>> http://www.archive.org/download/annalesacademiae21univ
>>>
>>> That automatically does a lookup on the cluster, which means you don't
>>> have to scrape info off pages.  You can also address any files within
>>> that directory, as in:
>>>
>> http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
>>> 1univ_marc.xml
>>>
>>> The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
>>> these scanned books is to grab them out of the MARC record.  So the
>>> long-winded answer to your question, Tim, is no, there's no simple way
>>> to crossref what IA has scanned with your catalog - THAT I KNOW OF.
>> Big
>>> caveat on that last part.
>>>
>>> Happy to help with any other questions I can,
>>>
>>> Chris Freeland
>>>
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>> Of
>>> Steve Toub
>>> Sent: Sunday, February 24, 2008 11:20 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] oca api?
>>>
>>> --- Tim Shearer <[log in to unmask]> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> I'm looking into tapping the texts in the Open Content Alliance.
>>>>
>>>> A few questions...
>>>>
>>>> As near as I can tell, they don't expose (perhaps even store?) any
>>> common
>>>> unique identifiers (oclc number, issn, isbn, loc number).
>>>
>>> I poked around in this world a few months ago in my previous job at
>>> California Digital Library,
>>> also an OCA partner.
>>>
>>> The unique key seems to be text string identifier (one that seems to
>> be
>>> completely different from
>>> the text string identifier in Open Library). Apparently there was talk
>>> at the last partner meeting
>>> about moving to ISBNs:
>>>
>> http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
>>> lliance/
>>>
>>> To obtain identifiers in bulk, I think the recommended approach is the
>>> OAI-PMH interface, which
>>> seems more reliable in recent months:
>>>
>>> http://www.archive.org/services/oai.php?verb=Identify
>>>
>>>
>> http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre
>>> fix=oai_dc&set=collection:cdl
>>>
>>> etc.
>>>
>>>
>>> Additional instructions if you want to grab the content files.
>>>
>>> From any book's metadata page (e.g.,
>>> http://www.archive.org/details/chemicallecturee00newtrich)
>>> click through on the "Usage Rights: See Terms" link; the rights are on
>> a
>>> pane on the left-hand
>>> side.
>>>
>>> Once you know the identifier, you can grab the content files, using
>> this
>>> syntax:
>>>     http://www.archive.org/details/$ID
>>> Like so:
>>>     http://www.archive.org/details/chemicallecturee00newtrich
>>>
>>> And then sniff the page to find the FTP link:
>>>     ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>
>>> But I think they prefer to use HTTP for these, not the FTP, so switch
>>> this to:
>>>     http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich
>>>
>>> Hope this helps!
>>>
>>>   --SET
>>>
>>>
>>>> We're a contributer so I can use curl to grab our records via http
>>> (and
>>>> regexp my way to our local catalog identifiers, which they do
>>>> store/expose).
>>>>
>>>> I've played a bit with the z39.50 interface at indexdata
>>>> (http://www.indexdata.dk/opencontent/), but I'm not confident about
>>> the
>>>> content behind it.  I get very limited results, for instance I can't
>>> find
>>>> any UNC records and we're fairly new to the game.
>>>>
>>>> Again, I'm looking for unique identifiers in what I can get back and
>>> it's
>>>> slim pickings.
>>>>
>>>> Anyone cracked this nut?  Got any life lessons for me?
>>>>
>>>> Thanks!
>>>> Tim
>>>>
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>> Tim Shearer
>>>>
>>>> Web Development Coordinator
>>>> The University Library
>>>> University of North Carolina at Chapel Hill
>>>> [log in to unmask]
>>>> 919-962-1288
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>
>>
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu