Print

Print


Hi Edward,

We're currently using the warc-tools library for WARC creation. It's written in Python, but there are a few pre-built utilities that come with the package that might suit your needs?

http://code.hanzoarchives.com/warc-tools

-Kurt
________________________________________
From: Code for Libraries [[log in to unmask]] on behalf of Edward M. Corrado [[log in to unmask]]
Sent: Wednesday, November 23, 2011 5:30 PM
To: [log in to unmask]
Subject: [CODE4LIB] Web archiving and WARC

Hello All,

I need to harvest a few Web sites in order to preserve them. I'd
really like to preserve them using the WARC file format [1] since it
is a standard for digital preservation. I looked at I looked at Web
Curator Tool (WCT) and Heritrix and they seem to be good at what they
do but are built to work on a much larger scale then what I'd like to
do -- and that comes with a cost of increased complexity. Tools like
wget are simple to use and can easily be scripted to accomplish my
limited task, except the standard wget and similar tools I am familiar
with do not support WARC. Also, I haven't been able to find a tool
that can convert zipped files created with wget to WARC.

I did find a version of wget with warc support built in [1] from the
Archive Team so that may be my solution, but compile software with
"dirty" written into the name of the zip file is maybe not the best
longterm solution. Does anyone know of any other simples tool to
create a WARC file (either from harvesting or converting a wget or
similar mirror/archive)?

Edward

[1] http://archiveteam.org/index.php?title=Wget_with_WARC_output