Mike, Thanks for the information about your WarcManager tool. I will check it out. Edward On Thu, Nov 24, 2011 at 8:08 AM, Mike Smorul <[log in to unmask]> wrote: > Hi, > We've been working on a tool to help manage warc files after you have > piles of them. It supports basic searching and content browsing. We've done > some testing up to ~10Tb of warc files and it's still fairly responsive. > > https://wiki.umiacs.umd.edu/adapt/index.php/WarcManager > > -Mike > > On Wed, Nov 23, 2011 at 11:46 PM, Erik Hetzner <[log in to unmask]>wrote: > >> At Wed, 23 Nov 2011 18:30:02 -0500, >> Edward M. Corrado wrote: >> > >> > Hello All, >> > >> > I need to harvest a few Web sites in order to preserve them. I'd >> > really like to preserve them using the WARC file format [1] since it >> > is a standard for digital preservation. I looked at I looked at Web >> > Curator Tool (WCT) and Heritrix and they seem to be good at what they >> > do but are built to work on a much larger scale then what I'd like to >> > do -- and that comes with a cost of increased complexity. Tools like >> > wget are simple to use and can easily be scripted to accomplish my >> > limited task, except the standard wget and similar tools I am familiar >> > with do not support WARC. Also, I haven't been able to find a tool >> > that can convert zipped files created with wget to WARC. >> > >> > I did find a version of wget with warc support built in [1] from the >> > Archive Team so that may be my solution, but compile software with >> > "dirty" written into the name of the zip file is maybe not the best >> > longterm solution. Does anyone know of any other simples tool to >> > create a WARC file (either from harvesting or converting a wget or >> > similar mirror/archive)? >> >> Hi Edward, >> >> The WCT uses Heritrix behind the scenes. Basically Heritrix or >> wget+warc are your only two solutions, unless you convert to WARC from >> something else. And I have never seen another crawler that gathers the >> information that needs to do into the WARC file. >> >> Heritrix isn’t that bad to get up & running. The more tricky issue is >> what to do with the WARC files once you have them. >> >> best, Erik >> >> Sent from my free software system <http://fsf.org/>. >> >> >