WARC is not an access format.
WARC is entirely optimised for crawling and the gold standard for archiving
because it's close to the 'on the wire' web experience.
BUT
There is no file index: you access every file using a linear search from
the start of the archive.
There is no guarantee that related files are stored together: an HTML page
and it's CSS, images and embedded streaming video
There is no guarantee that related pages are stored together.
If you're using WARC for access, you need something that overcomes these
limitations, and the obvious choice is CDX indexes. For an explanation of
how CDX files index WARC files, see the diagram on
https://support.archive-it.org/hc/en-us/articles/115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API
---
Alternatively, use wget with the --convert-links option over your WARC /
pywb solution. This should be faster than 40 mins per page on average,
since CSS and branding images should only have to be retrieved once
(assuming sane site design).
cheers
stuart
--
...let us be heard from red core to black sky
On Thu, 5 Mar 2020 at 04:37, Demian Katz <[log in to unmask]> wrote:
> Hello, everyone –
>
> I’ve been struggling with a use case that feels like it can’t be unique to
> my situation. Wondering if anyone else has solved this!
>
> We’ve decommissioned an old dynamic site, and we still want to make the
> content available in a static form. It was a large and complex site with a
> lot of pages, and after trying a variety of solutions, we ended up
> harvesting it all into a WARC file. This is great for archival purposes,
> but we’re struggling with presentation.
>
> The problem with serving content from a WARC is that it seems to be
> unbearably slow in every solution we try. (And when I say unbearably, I
> mean “40 minutes to load one page using pywb” – not kidding).
>
> I assume that this slowness has to do with dynamically navigating around
> in a multi-gigabyte file to retrieve things… but really all we want to do
> is serve up static content.
>
> Is there some tool that can simply unpack a WARC into a directory of
> static files that can be navigated quickly? It seems like this should be
> possible, but I’m coming up empty in searching.
>
> And just to be clear: I understand that unpacking a WARC probably won’t
> retain all of the richness of detail that dynamic retrieval from the WARC
> can provide, and I certainly don’t plan to throw away the WARC… but for
> people who just want to quickly navigate content from the most
> recently-crawled version of the old site, I want a solution that will
> perform acceptably, and I haven’t found it yet.
>
> Thanks for any and all advice! 😊
>
> - Demian
>
|