Print

Print


Is there anybody here who works for Internet Archive Scholar, or can somebody tell me how I might be able to download an archived file from the Wayback Machine?

A couple of weeks ago I learned about Internet Archive Scholar. [1] This is an index of scholarly content harvested from the 'Net. It is possible to query Scholar and get back JSON, and the JSON is full of cool and interesting bibliographic data. Here is a snippet of the JSON, and it describes the full text of an item:

  "fulltext": {
	"file_mimetype": "application/pdf",
	"access_type": "wayback",
	"file_sha1": "c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d",
	"size_bytes": 1467536,
	"file_ident": "sddxyle4qzaz5lq3hexyw2h4my",
	"access_url": "https://web.archive.org/web/20190516143829/https://aibstudi.aib.it/article/download/11501/10805",
	"release_ident": "tu5d2xp53jg3pmwa4lyjbtj45m",
	"thumbnail_url": "https://blobs.fatcat.wiki/thumbnail/pdf/c3/f8/c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d.180px.jpg"
  },

I can parse the value of access_url to get a URL, but because of the nature of the 'Net, the URLs are broken about 33% of the time (antidotally speaking). Yes, I can use the full access_url, but this returns an HTML page with the something inside an iframe, I think. I want the actual thing, not a splash/landing/metadata page. 

Is there a way to programmatically reverse engineer the value of access_url (sans screen scraping) and get back a URL pointing to the item? 

By the way, the Internet Archive Scholar is pretty nifty. You can query the index, get back a bucket o' JSON, parse the JSON and pour it in a database, query the database, harvest the full text of items, and then send the result off to my Reader. This morning I used the query "Henry David Thoreau", downloaded almost 1,600 journal articles, and proceeded to "read" them. The whole process -- from beginning to end -- took about twenty minutes. There no way one can search for, download, and "read" 1,600 articles from a vended index.

Again, the value of access_url returns an HTML page, but what I really want is the thing in-and-of itself. Is there a way to do this?


[1] Internet Archive Scholar - https://scholar.archive.org/about

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [log in to unmask]
w: cds.library.nd.edu