Print

Print


I was dealing with a lot of sites that would shunt the user around based on
their user agent (e.g. very old sites that had completely different pages
for Netscape and IE), so I needed something neutral that wouldn't get
caught in a browser-specific branch.  Suffice to say, nothing ever checks
for Amiga browsers :)

On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV <
[log in to unmask]> wrote:

> I love that user agent.
>
> This the wget command I've used to back up sites that have pretty urls:
>
> wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ <URL>
>
>
> – Jamie
> ________________________________________
> From: Code for Libraries <[log in to unmask]> on behalf of
> Alexander Duryee <[log in to unmask]>
> Sent: Monday, October 06, 2014 11:51 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] wget archiving for dummies
>
> I've used wget extensively for web preservation.  It's a remarkably
> powerful tool, but there are some notable features/caveats to be aware of:
>
> 1) You absolutely should use the --warc-file=<NAME> and
> --warc-header=<STRING> options.  These will create a WARC file alongside
> the usual wget filedump, which captures essential information (process
> provenance, server request/responses, raw data before wget adjusts it) for
> preservation.  The warc-header option includes user-added metadata, such as
> the name, purpose, etc. of the capture.  It's likely that you won't use the
> WARC for access, but keeping it as a preservation copy of the site is
> invaluable.
>
> 2) Javascript, AJAX queries, links in rich media, and such are completely
> opaque to wget.  As such, you'll need to QC aggressively to ensure that you
> captured everything you intended to.  My method was to run a generic wget
> capture[1], QC it, and manually download missing objects.  I'd then pass
> everything back into wget to create a complete WARC file containing the
> full capture.  It's janky, but gets the job done.
>
> 3) Do be careful of commenting options, which often turn into spider
> traps.  The latest versions of wget have regex support, so you can
> blacklist certain URLs that you know will trap the crawler.
>
> If the site is proving stubborn, I can take a look off-list.
>
> Best of luck,
> Alex
>
> [1] I've used the following successfully: wget
> --user-agent="AmigaVoyager/3.2
> (AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>"
> --page-requisites -e robots=off --random-wait --wait=5 --recursive
> --level=0
> --no-parent --convert-links <URL>
>