Thank you, Stuart, and to everyone else who answered both on- and off-list. I now have a few different ideas I can try! It may take me a little while to find time to try them all, but I'll report back with a solution once I've found something that meets my needs, in case it's helpful to others in future. I greatly appreciate all of your support. 😊
- Demian
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Stuart A. Yeates
Sent: Wednesday, March 4, 2020 4:36 PM
To: [log in to unmask]
Subject: [EXTERNAL] Re: [CODE4LIB] WARC --> static HTML?
WARC is not an access format.
WARC is entirely optimised for crawling and the gold standard for archiving because it's close to the 'on the wire' web experience.
BUT
There is no file index: you access every file using a linear search from the start of the archive.
There is no guarantee that related files are stored together: an HTML page and it's CSS, images and embedded streaming video There is no guarantee that related pages are stored together.
If you're using WARC for access, you need something that overcomes these limitations, and the obvious choice is CDX indexes. For an explanation of how CDX files index WARC files, see the diagram on
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.archive-it.org%2Fhc%2Fen-us%2Farticles%2F115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API&data=02%7C01%7Cdemian.katz%40VILLANOVA.EDU%7C0909a4083ea24454af7008d7c084466e%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C637189546668073215&sdata=7JuqJlnJactLtEftPPJ%2BkMHXdV%2B2DxRGDz%2BQ8073r9k%3D&reserved=0
---
Alternatively, use wget with the --convert-links option over your WARC / pywb solution. This should be faster than 40 mins per page on average, since CSS and branding images should only have to be retrieved once (assuming sane site design).
cheers
stuart
--
...let us be heard from red core to black sky
On Thu, 5 Mar 2020 at 04:37, Demian Katz <[log in to unmask]> wrote:
> Hello, everyone –
>
> I’ve been struggling with a use case that feels like it can’t be
> unique to my situation. Wondering if anyone else has solved this!
>
> We’ve decommissioned an old dynamic site, and we still want to make
> the content available in a static form. It was a large and complex
> site with a lot of pages, and after trying a variety of solutions, we
> ended up harvesting it all into a WARC file. This is great for
> archival purposes, but we’re struggling with presentation.
>
> The problem with serving content from a WARC is that it seems to be
> unbearably slow in every solution we try. (And when I say unbearably,
> I mean “40 minutes to load one page using pywb” – not kidding).
>
> I assume that this slowness has to do with dynamically navigating
> around in a multi-gigabyte file to retrieve things… but really all we
> want to do is serve up static content.
>
> Is there some tool that can simply unpack a WARC into a directory of
> static files that can be navigated quickly? It seems like this should
> be possible, but I’m coming up empty in searching.
>
> And just to be clear: I understand that unpacking a WARC probably
> won’t retain all of the richness of detail that dynamic retrieval from
> the WARC can provide, and I certainly don’t plan to throw away the
> WARC… but for people who just want to quickly navigate content from
> the most recently-crawled version of the old site, I want a solution
> that will perform acceptably, and I haven’t found it yet.
>
> Thanks for any and all advice! 😊
>
> - Demian
>
|