Print

Print


At Wed, 23 Nov 2011 18:30:02 -0500,
Edward M. Corrado wrote:
> 
> Hello All,
> 
> I need to harvest a few Web sites in order to preserve them. I'd
> really like to preserve them using the WARC file format [1] since it
> is a standard for digital preservation. I looked at I looked at Web
> Curator Tool (WCT) and Heritrix and they seem to be good at what they
> do but are built to work on a much larger scale then what I'd like to
> do -- and that comes with a cost of increased complexity. Tools like
> wget are simple to use and can easily be scripted to accomplish my
> limited task, except the standard wget and similar tools I am familiar
> with do not support WARC. Also, I haven't been able to find a tool
> that can convert zipped files created with wget to WARC.
> 
> I did find a version of wget with warc support built in [1] from the
> Archive Team so that may be my solution, but compile software with
> "dirty" written into the name of the zip file is maybe not the best
> longterm solution. Does anyone know of any other simples tool to
> create a WARC file (either from harvesting or converting a wget or
> similar mirror/archive)?

Hi Edward,

The WCT uses Heritrix behind the scenes. Basically Heritrix or
wget+warc are your only two solutions, unless you convert to WARC from
something else. And I have never seen another crawler that gathers the
information that needs to do into the WARC file.

Heritrix isn’t that bad to get up & running. The more tricky issue is
what to do with the WARC files once you have them.

best, Erik