If you don't mind having your data in WARC format, you could use:
* The Webrecorder web service (https://webrecorder.io/), which records to an archive pages that you browse. Works well if you only have a small number of pages to archive and has the advantage that it can archive whatever you can access via your browser. Just make sure to set the collection to private and/or download and delete it once completed.
* The Heritrix archival crawler support HTTP authentication (https://webarchive.jira.com/wiki/display/Heritrix/Credentials), much like HTTrack or wget, with the added advantage of storing the files in WARC.
From: Alex Armstrong [mailto:[log in to unmask]]
Sent: Tuesday, January 17, 2017 7:09 AM
Subject: Re: How to archive selected pages from a site requiring authentication
Hi Mike & Tom,
I didn’t clarify in my original question that I’m looking to access a site that uses form-based authentication.
You’re both pointing me to the same which is to provide cookies to a CLI tool. You suggest wget, I began by looking at httrack and someone off-list suggested curl. All of these should work :)
I’ve been swamped by other work to try this, but my next steps are surer now. Thanks folks!
On 15 January 2017 at 01:49:20, Hagedon, Mike - (mhagedon) ([log in to unmask]) wrote:
It might really depend on the kind of authentication used, but a number of years ago I had to do something similar for a site protected by university (CAS) authn. If I recall correctly, I logged into the site with Firefox, and then told wget to use Firefox cookies. More or less like this like the "easy" version of the accepted answer here:
Mike Hagedon | Team Lead for Software & Web Development (Dev) | Technology Strategy & Services | University of Arizona Libraries
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Alex Armstrong
Sent: Friday, January 13, 2017 12:42 AM
To: [log in to unmask]
Subject: [CODE4LIB] How to archive selected pages from a site requiring authentication
Has anyone had to archive selected pages from a login-protected site? How did you do it?
I've used the CLI tool httrack in the past for archiving sites. But in this case, accessing the pages require logging in. There's some vague documentation about how to do this with httrack, but I haven't cracked it yet. (The instructions are better for the Windows version of the application, but I only have ready access to a Mac.)
Before I go on a wild goose chase, any help would be much appreciated.
Web Developer & Digital Strategist, AMICAL Consortium [log in to unmask]