LISTSERV 16.5 - CODE4LIB Archives

Thanks for the advice all. I'm trying httrack now but the other wget
options are good to know about, especially Alex's point about saving a WARC
file.

One clarification: I definitely don't want to deal with the database, nor
can I. We don't have admin or server access. Even if we did, I don't think
preserving the db would be wise or necessary.

Best,
Eric

On Mon, Oct 6, 2014 at 9:24 AM, Alexander Duryee <[log in to unmask]>
wrote:

> I was dealing with a lot of sites that would shunt the user around based on
> their user agent (e.g. very old sites that had completely different pages
> for Netscape and IE), so I needed something neutral that wouldn't get
> caught in a browser-specific branch.  Suffice to say, nothing ever checks
> for Amiga browsers :)
>
> On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV <
> [log in to unmask]> wrote:
>
> > I love that user agent.
> >
> > This the wget command I've used to back up sites that have pretty urls:
> >
> > wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./
> <URL>
> >
> >
> > – Jamie
> > ________________________________________
> > From: Code for Libraries <[log in to unmask]> on behalf of
> > Alexander Duryee <[log in to unmask]>
> > Sent: Monday, October 06, 2014 11:51 AM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] wget archiving for dummies
> >
> > I've used wget extensively for web preservation.  It's a remarkably
> > powerful tool, but there are some notable features/caveats to be aware
> of:
> >
> > 1) You absolutely should use the --warc-file=<NAME> and
> > --warc-header=<STRING> options.  These will create a WARC file alongside
> > the usual wget filedump, which captures essential information (process
> > provenance, server request/responses, raw data before wget adjusts it)
> for
> > preservation.  The warc-header option includes user-added metadata, such
> as
> > the name, purpose, etc. of the capture.  It's likely that you won't use
> the
> > WARC for access, but keeping it as a preservation copy of the site is
> > invaluable.
> >
> > 2) Javascript, AJAX queries, links in rich media, and such are completely
> > opaque to wget.  As such, you'll need to QC aggressively to ensure that
> you
> > captured everything you intended to.  My method was to run a generic wget
> > capture[1], QC it, and manually download missing objects.  I'd then pass
> > everything back into wget to create a complete WARC file containing the
> > full capture.  It's janky, but gets the job done.
> >
> > 3) Do be careful of commenting options, which often turn into spider
> > traps.  The latest versions of wget have regex support, so you can
> > blacklist certain URLs that you know will trap the crawler.
> >
> > If the site is proving stubborn, I can take a look off-list.
> >
> > Best of luck,
> > Alex
> >
> > [1] I've used the following successfully: wget
> > --user-agent="AmigaVoyager/3.2
> > (AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>"
> > --page-requisites -e robots=off --random-wait --wait=5 --recursive
> > --level=0
> > --no-parent --convert-links <URL>
> >
>