I love that user agent. This the wget command I've used to back up sites that have pretty urls: wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ <URL> – Jamie ________________________________________ From: Code for Libraries <[log in to unmask]> on behalf of Alexander Duryee <[log in to unmask]> Sent: Monday, October 06, 2014 11:51 AM To: [log in to unmask] Subject: Re: [CODE4LIB] wget archiving for dummies I've used wget extensively for web preservation. It's a remarkably powerful tool, but there are some notable features/caveats to be aware of: 1) You absolutely should use the --warc-file=<NAME> and --warc-header=<STRING> options. These will create a WARC file alongside the usual wget filedump, which captures essential information (process provenance, server request/responses, raw data before wget adjusts it) for preservation. The warc-header option includes user-added metadata, such as the name, purpose, etc. of the capture. It's likely that you won't use the WARC for access, but keeping it as a preservation copy of the site is invaluable. 2) Javascript, AJAX queries, links in rich media, and such are completely opaque to wget. As such, you'll need to QC aggressively to ensure that you captured everything you intended to. My method was to run a generic wget capture[1], QC it, and manually download missing objects. I'd then pass everything back into wget to create a complete WARC file containing the full capture. It's janky, but gets the job done. 3) Do be careful of commenting options, which often turn into spider traps. The latest versions of wget have regex support, so you can blacklist certain URLs that you know will trap the crawler. If the site is proving stubborn, I can take a look off-list. Best of luck, Alex [1] I've used the following successfully: wget --user-agent="AmigaVoyager/3.2 (AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>" --page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0 --no-parent --convert-links <URL>