LISTSERV 16.5 - CODE4LIB Archives

A number of others have suggested other approaches, but since you started with wget, here are the two wget commands I recently used to archive a wordpress-behind-exproxy site. The first logs into ezproxy and saves the login as a cookie. The second uses to cookie to access a site through exproxy 

wget  --no-check-certificate --keep-session-cookies  --save-cookies cookies.txt  --post-data 'user=yeatesst&pass=PASSWORD&auth=d1&url' 
https://login.EZPROXYMACHINE/login

wget --restrict-file-names=windows  --default-page=index.php -e robots=off  --mirror --user-agent="" --ignore-length --keep-session-cookies  --save-cookies cookies.txt --load-cookies cookies.txt --recursive  --page-requisites --convert-links --backup-converted "http://WORDPRESSMACHINE. EZPROXYMACHINE/BLOGNAME"

cheers
stuart


-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Phetteplace
Sent: Monday, 6 October 2014 7:44 p.m.
To: [log in to unmask]
Subject: [CODE4LIB] wget archiving for dummies

Hey C4L,

If I wanted to archive a Wordpress site, how would I do so?

More elaborate: our library recently got a "donation" of a remote Wordpress site, sitting one directory below the root of a domain. I can tell from a cursory look it's a Wordpress site. We've never archived a website before and I don't need to do anything fancy, just download a workable copy as it presently exists. I've heard this can be as simple as:

wget -m $PATH_TO_SITE_ROOT

but that's not working as planned. Wget's convert links feature doesn't seem to be quite so simple; if I download the site, disable my network connection, then host locally, some 20 resources aren't available. Mostly images which are under the same directory. Possibly loaded via AJAX. Advice?

(Anticipated) pertinent advice: I shouldn't be doing this at all, we should outsource to Archive-It or similar, who actually know what they're doing.
Yes/no?

Best,
Eric