LISTSERV 16.5 - CODE4LIB Archives

Hi Alex,

If you don't mind having your data in WARC format, you could use:
* The Webrecorder web service (https://webrecorder.io/), which records to an archive pages that you browse. Works well if you only have a small number of pages to archive and has the advantage that it can archive whatever you can access via your browser. Just make sure to set the collection to private and/or download and delete it once completed.
* The Heritrix archival crawler support HTTP authentication (https://webarchive.jira.com/wiki/display/Heritrix/Credentials), much like HTTrack or wget, with the added advantage of storing the files in WARC.

~Nicholas

-----Original Message-----
From: Alex Armstrong [mailto:[log in to unmask]] 
Sent: Tuesday, January 17, 2017 7:09 AM
Subject: Re: How to archive selected pages from a site requiring authentication

Hi Mike & Tom,

I didn’t clarify in my original question that I’m looking to access a site that uses form-based authentication.

You’re both pointing me to the same which is to provide cookies to a CLI tool. You suggest wget, I began by looking at httrack and someone off-list suggested curl. All of these should work :)

I’ve been swamped by other work to try this, but my next steps are surer now. Thanks folks!

Alex

On 15 January 2017 at 01:49:20, Hagedon, Mike - (mhagedon) ([log in to unmask]) wrote:

Hi Alex,  
It might really depend on the kind of authentication used, but a number of years ago I had to do something similar for a site protected by university (CAS) authn. If I recall correctly, I logged into the site with Firefox, and then told wget to use Firefox cookies. More or less like this like the "easy" version of the accepted answer here:  

http://askubuntu.com/questions/161778/how-do-i-use-wget-curl-to-download-from-a-site-i-am-logged-into  

Mike  

Mike Hagedon | Team Lead for Software & Web Development (Dev) | Technology Strategy & Services | University of Arizona Libraries  


-----Original Message-----  
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Alex Armstrong  
Sent: Friday, January 13, 2017 12:42 AM  
To: [log in to unmask]  
Subject: [CODE4LIB] How to archive selected pages from a site requiring authentication  

Has anyone had to archive selected pages from a login-protected site? How did you do it?  

I've used the CLI tool httrack in the past for archiving sites. But in this case, accessing the pages require logging in. There's some vague documentation about how to do this with httrack, but I haven't cracked it yet. (The instructions are better for the Windows version of the application, but I only have ready access to a Mac.)  

Before I go on a wild goose chase, any help would be much appreciated.  

Alex  

--  
Alex Armstrong  
Web Developer & Digital Strategist, AMICAL Consortium [log in to unmask]