Print

Print


I probably don't understand the question ...

(Disclaimer:  this is on Windows, YMMV)

Using the complete access url (starting with https://web.archive.org/...),
you can get the PDF document using curl:

     curl access_url --output something.pdf

You probably want to see the mime type before you download the file so you
can set the file suffix correctly.  You can get that using:

     curl --head access_url

Also, it's straightforward in Python because the requests library pretty
much handles everything automagically:

---
import requests
import sys

if len(sys.argv) == 3:
    response = requests.get(sys.argv[1])
    print("Status code is:", response.status_code)
    print("Content type is: ", response.headers.get("content-type", "(no
content type specified)"))
    with open(sys.argv[2], "wb") as out:
        out.write(response.content)
else:
    print("Two arguments:  url and output file")
---

Does that help?

(btw, the browser probably gets an HTML page from the access url because it
ASKS for an HTML page.)

Graeme Williams
Las Vegas, NV
github.com/lagbolt

On Mon, Mar 22, 2021 at 10:04 AM Eric Lease Morgan <[log in to unmask]> wrote:

> Is there anybody here who works for Internet Archive Scholar, or can
> somebody tell me how I might be able to download an archived file from the
> Wayback Machine?
>
> A couple of weeks ago I learned about Internet Archive Scholar. [1] This
> is an index of scholarly content harvested from the 'Net. It is possible to
> query Scholar and get back JSON, and the JSON is full of cool and
> interesting bibliographic data. Here is a snippet of the JSON, and it
> describes the full text of an item:
>
>   "fulltext": {
>         "file_mimetype": "application/pdf",
>         "access_type": "wayback",
>         "file_sha1": "c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d",
>         "size_bytes": 1467536,
>         "file_ident": "sddxyle4qzaz5lq3hexyw2h4my",
>         "access_url": "
> https://web.archive.org/web/20190516143829/https://aibstudi.aib.it/article/download/11501/10805
> ",
>         "release_ident": "tu5d2xp53jg3pmwa4lyjbtj45m",
>         "thumbnail_url": "
> https://blobs.fatcat.wiki/thumbnail/pdf/c3/f8/c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d.180px.jpg
> "
>   },
>
> I can parse the value of access_url to get a URL, but because of the
> nature of the 'Net, the URLs are broken about 33% of the time (antidotally
> speaking). Yes, I can use the full access_url, but this returns an HTML
> page with the something inside an iframe, I think. I want the actual thing,
> not a splash/landing/metadata page.
>
> Is there a way to programmatically reverse engineer the value of
> access_url (sans screen scraping) and get back a URL pointing to the item?
>
> By the way, the Internet Archive Scholar is pretty nifty. You can query
> the index, get back a bucket o' JSON, parse the JSON and pour it in a
> database, query the database, harvest the full text of items, and then send
> the result off to my Reader. This morning I used the query "Henry David
> Thoreau", downloaded almost 1,600 journal articles, and proceeded to "read"
> them. The whole process -- from beginning to end -- took about twenty
> minutes. There no way one can search for, download, and "read" 1,600
> articles from a vended index.
>
> Again, the value of access_url returns an HTML page, but what I really
> want is the thing in-and-of itself. Is there a way to do this?
>
>
> [1] Internet Archive Scholar - https://scholar.archive.org/about
>
> --
> Eric Lease Morgan
> Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
> Hesburgh Libraries
>
> University of Notre Dame
> 250E Hesburgh Library
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]
> w: cds.library.nd.edu
>