I don't if this will help, but might be an option.
If the site is still operational through a browser, this software https://www.httrack.com/ will capture, literally, everything, and make it available for offline viewing.
________________________________________
From: Code for Libraries [[log in to unmask]] on behalf of Demian Katz [[log in to unmask]]
Sent: Wednesday, March 04, 2020 10:37 AM
To: [log in to unmask]
Subject: [CODE4LIB] WARC --> static HTML?
Hello, everyone –
I’ve been struggling with a use case that feels like it can’t be unique to my situation. Wondering if anyone else has solved this!
We’ve decommissioned an old dynamic site, and we still want to make the content available in a static form. It was a large and complex site with a lot of pages, and after trying a variety of solutions, we ended up harvesting it all into a WARC file. This is great for archival purposes, but we’re struggling with presentation.
The problem with serving content from a WARC is that it seems to be unbearably slow in every solution we try. (And when I say unbearably, I mean “40 minutes to load one page using pywb” – not kidding).
I assume that this slowness has to do with dynamically navigating around in a multi-gigabyte file to retrieve things… but really all we want to do is serve up static content.
Is there some tool that can simply unpack a WARC into a directory of static files that can be navigated quickly? It seems like this should be possible, but I’m coming up empty in searching.
And just to be clear: I understand that unpacking a WARC probably won’t retain all of the richness of detail that dynamic retrieval from the WARC can provide, and I certainly don’t plan to throw away the WARC… but for people who just want to quickly navigate content from the most recently-crawled version of the old site, I want a solution that will perform acceptably, and I haven’t found it yet.
Thanks for any and all advice! 😊
- Demian
|