I probably don't understand the question ... (Disclaimer: this is on Windows, YMMV) Using the complete access url (starting with https://web.archive.org/...), you can get the PDF document using curl: curl access_url --output something.pdf You probably want to see the mime type before you download the file so you can set the file suffix correctly. You can get that using: curl --head access_url Also, it's straightforward in Python because the requests library pretty much handles everything automagically: --- import requests import sys if len(sys.argv) == 3: response = requests.get(sys.argv[1]) print("Status code is:", response.status_code) print("Content type is: ", response.headers.get("content-type", "(no content type specified)")) with open(sys.argv[2], "wb") as out: out.write(response.content) else: print("Two arguments: url and output file") --- Does that help? (btw, the browser probably gets an HTML page from the access url because it ASKS for an HTML page.) Graeme Williams Las Vegas, NV github.com/lagbolt On Mon, Mar 22, 2021 at 10:04 AM Eric Lease Morgan <[log in to unmask]> wrote: > Is there anybody here who works for Internet Archive Scholar, or can > somebody tell me how I might be able to download an archived file from the > Wayback Machine? > > A couple of weeks ago I learned about Internet Archive Scholar. [1] This > is an index of scholarly content harvested from the 'Net. It is possible to > query Scholar and get back JSON, and the JSON is full of cool and > interesting bibliographic data. Here is a snippet of the JSON, and it > describes the full text of an item: > > "fulltext": { > "file_mimetype": "application/pdf", > "access_type": "wayback", > "file_sha1": "c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d", > "size_bytes": 1467536, > "file_ident": "sddxyle4qzaz5lq3hexyw2h4my", > "access_url": " > https://web.archive.org/web/20190516143829/https://aibstudi.aib.it/article/download/11501/10805 > ", > "release_ident": "tu5d2xp53jg3pmwa4lyjbtj45m", > "thumbnail_url": " > https://blobs.fatcat.wiki/thumbnail/pdf/c3/f8/c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d.180px.jpg > " > }, > > I can parse the value of access_url to get a URL, but because of the > nature of the 'Net, the URLs are broken about 33% of the time (antidotally > speaking). Yes, I can use the full access_url, but this returns an HTML > page with the something inside an iframe, I think. I want the actual thing, > not a splash/landing/metadata page. > > Is there a way to programmatically reverse engineer the value of > access_url (sans screen scraping) and get back a URL pointing to the item? > > By the way, the Internet Archive Scholar is pretty nifty. You can query > the index, get back a bucket o' JSON, parse the JSON and pour it in a > database, query the database, harvest the full text of items, and then send > the result off to my Reader. This morning I used the query "Henry David > Thoreau", downloaded almost 1,600 journal articles, and proceeded to "read" > them. The whole process -- from beginning to end -- took about twenty > minutes. There no way one can search for, download, and "read" 1,600 > articles from a vended index. > > Again, the value of access_url returns an HTML page, but what I really > want is the thing in-and-of itself. Is there a way to do this? > > > [1] Internet Archive Scholar - https://scholar.archive.org/about > > -- > Eric Lease Morgan > Digital Initiatives Librarian, Navari Family Center for Digital Scholarship > Hesburgh Libraries > > University of Notre Dame > 250E Hesburgh Library > Notre Dame, IN 46556 > o: 574-631-8604 > e: [log in to unmask] > w: cds.library.nd.edu >