Print

Print


https://archive-it.org/ the subscription service of https://archive.org/
does login-protected sites.

We've found them to be very helpful and the software to just work, but
we've never done any password protected sites.

cheers
stuart

--
...let us be heard from red core to black sky

On Thu, Jan 19, 2017 at 5:54 AM, Nicholas Taylor <[log in to unmask]> wrote:

> Hi Alex,
>
> If you don't mind having your data in WARC format, you could use:
> * The Webrecorder web service (https://webrecorder.io/), which records to
> an archive pages that you browse. Works well if you only have a small
> number of pages to archive and has the advantage that it can archive
> whatever you can access via your browser. Just make sure to set the
> collection to private and/or download and delete it once completed.
> * The Heritrix archival crawler support HTTP authentication (
> https://webarchive.jira.com/wiki/display/Heritrix/Credentials), much like
> HTTrack or wget, with the added advantage of storing the files in WARC.
>
> ~Nicholas
>
> -----Original Message-----
> From: Alex Armstrong [mailto:[log in to unmask]]
> Sent: Tuesday, January 17, 2017 7:09 AM
> Subject: Re: How to archive selected pages from a site requiring
> authentication
>
> Hi Mike & Tom,
>
> I didn’t clarify in my original question that I’m looking to access a site
> that uses form-based authentication.
>
> You’re both pointing me to the same which is to provide cookies to a CLI
> tool. You suggest wget, I began by looking at httrack and someone off-list
> suggested curl. All of these should work :)
>
> I’ve been swamped by other work to try this, but my next steps are surer
> now. Thanks folks!
>
> Alex
>
> On 15 January 2017 at 01:49:20, Hagedon, Mike - (mhagedon) (
> [log in to unmask]) wrote:
>
> Hi Alex,
> It might really depend on the kind of authentication used, but a number of
> years ago I had to do something similar for a site protected by university
> (CAS) authn. If I recall correctly, I logged into the site with Firefox,
> and then told wget to use Firefox cookies. More or less like this like the
> "easy" version of the accepted answer here:
>
> http://askubuntu.com/questions/161778/how-do-i-use-
> wget-curl-to-download-from-a-site-i-am-logged-into
>
> Mike
>
> Mike Hagedon | Team Lead for Software & Web Development (Dev) | Technology
> Strategy & Services | University of Arizona Libraries
>
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Alex Armstrong
> Sent: Friday, January 13, 2017 12:42 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] How to archive selected pages from a site requiring
> authentication
>
> Has anyone had to archive selected pages from a login-protected site? How
> did you do it?
>
> I've used the CLI tool httrack in the past for archiving sites. But in
> this case, accessing the pages require logging in. There's some vague
> documentation about how to do this with httrack, but I haven't cracked it
> yet. (The instructions are better for the Windows version of the
> application, but I only have ready access to a Mac.)
>
> Before I go on a wild goose chase, any help would be much appreciated.
>
> Alex
>
> --
> Alex Armstrong
> Web Developer & Digital Strategist, AMICAL Consortium
> [log in to unmask]
>