A number of others have suggested other approaches, but since you started with wget, here are the two wget commands I recently used to archive a wordpress-behind-exproxy site. The first logs into ezproxy and saves the login as a cookie. The second uses to cookie to access a site through exproxy
wget --no-check-certificate --keep-session-cookies --save-cookies cookies.txt --post-data 'user=yeatesst&pass=PASSWORD&auth=d1&url'
https://login.EZPROXYMACHINE/login
wget --restrict-file-names=windows --default-page=index.php -e robots=off --mirror --user-agent="" --ignore-length --keep-session-cookies --save-cookies cookies.txt --load-cookies cookies.txt --recursive --page-requisites --convert-links --backup-converted "http://WORDPRESSMACHINE. EZPROXYMACHINE/BLOGNAME"
cheers
stuart
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Phetteplace
Sent: Monday, 6 October 2014 7:44 p.m.
To: [log in to unmask]
Subject: [CODE4LIB] wget archiving for dummies
Hey C4L,
If I wanted to archive a Wordpress site, how would I do so?
More elaborate: our library recently got a "donation" of a remote Wordpress site, sitting one directory below the root of a domain. I can tell from a cursory look it's a Wordpress site. We've never archived a website before and I don't need to do anything fancy, just download a workable copy as it presently exists. I've heard this can be as simple as:
wget -m $PATH_TO_SITE_ROOT
but that's not working as planned. Wget's convert links feature doesn't seem to be quite so simple; if I download the site, disable my network connection, then host locally, some 20 resources aren't available. Mostly images which are under the same directory. Possibly loaded via AJAX. Advice?
(Anticipated) pertinent advice: I shouldn't be doing this at all, we should outsource to Archive-It or similar, who actually know what they're doing.
Yes/no?
Best,
Eric
|