I was dealing with a lot of sites that would shunt the user around based on their user agent (e.g. very old sites that had completely different pages for Netscape and IE), so I needed something neutral that wouldn't get caught in a browser-specific branch. Suffice to say, nothing ever checks for Amiga browsers :) On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV < [log in to unmask]> wrote: > I love that user agent. > > This the wget command I've used to back up sites that have pretty urls: > > wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ <URL> > > > – Jamie > ________________________________________ > From: Code for Libraries <[log in to unmask]> on behalf of > Alexander Duryee <[log in to unmask]> > Sent: Monday, October 06, 2014 11:51 AM > To: [log in to unmask] > Subject: Re: [CODE4LIB] wget archiving for dummies > > I've used wget extensively for web preservation. It's a remarkably > powerful tool, but there are some notable features/caveats to be aware of: > > 1) You absolutely should use the --warc-file=<NAME> and > --warc-header=<STRING> options. These will create a WARC file alongside > the usual wget filedump, which captures essential information (process > provenance, server request/responses, raw data before wget adjusts it) for > preservation. The warc-header option includes user-added metadata, such as > the name, purpose, etc. of the capture. It's likely that you won't use the > WARC for access, but keeping it as a preservation copy of the site is > invaluable. > > 2) Javascript, AJAX queries, links in rich media, and such are completely > opaque to wget. As such, you'll need to QC aggressively to ensure that you > captured everything you intended to. My method was to run a generic wget > capture[1], QC it, and manually download missing objects. I'd then pass > everything back into wget to create a complete WARC file containing the > full capture. It's janky, but gets the job done. > > 3) Do be careful of commenting options, which often turn into spider > traps. The latest versions of wget have regex support, so you can > blacklist certain URLs that you know will trap the crawler. > > If the site is proving stubborn, I can take a look off-list. > > Best of luck, > Alex > > [1] I've used the following successfully: wget > --user-agent="AmigaVoyager/3.2 > (AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>" > --page-requisites -e robots=off --random-wait --wait=5 --recursive > --level=0 > --no-parent --convert-links <URL> >