Thanks, Carol. It was an old Drupal site, but unfortunately, at this point, the site is already gone, so the WARC is all I have -- but yes, I'm essentially looking for a way to unpack a WARC into the equivalent of the output of a giant wget operation. (We opted not to do things that way in the first place because some things needed to be changed incrementally as we went, and the WARC approach offered more flexibility in harvesting... but at the time, I had assumed that serving up a WARC was much simpler than it has so far turned out to be).
- Demian
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Carol Kassel
Sent: Wednesday, March 4, 2020 10:49 AM
To: [log in to unmask]
Subject: [EXTERNAL] Re: [CODE4LIB] WARC --> static HTML?
Hi Demian,
What kind of dynamic site is it? I've decommissioned old Drupal and Wordpress sites by essentially doing a big wget and generating a bunch of HTML pages. It's not perfect, and it depends on the site in question, but it has worked well enough. I wonder if that could serve as a Plan B for you.
Best wishes,
Carol
On Wed, Mar 4, 2020 at 10:37 AM Demian Katz <[log in to unmask]>
wrote:
> Hello, everyone –
>
> I’ve been struggling with a use case that feels like it can’t be
> unique to my situation. Wondering if anyone else has solved this!
>
> We’ve decommissioned an old dynamic site, and we still want to make
> the content available in a static form. It was a large and complex
> site with a lot of pages, and after trying a variety of solutions, we
> ended up harvesting it all into a WARC file. This is great for
> archival purposes, but we’re struggling with presentation.
>
> The problem with serving content from a WARC is that it seems to be
> unbearably slow in every solution we try. (And when I say unbearably,
> I mean “40 minutes to load one page using pywb” – not kidding).
>
> I assume that this slowness has to do with dynamically navigating
> around in a multi-gigabyte file to retrieve things… but really all we
> want to do is serve up static content.
>
> Is there some tool that can simply unpack a WARC into a directory of
> static files that can be navigated quickly? It seems like this should
> be possible, but I’m coming up empty in searching.
>
> And just to be clear: I understand that unpacking a WARC probably
> won’t retain all of the richness of detail that dynamic retrieval from
> the WARC can provide, and I certainly don’t plan to throw away the
> WARC… but for people who just want to quickly navigate content from
> the most recently-crawled version of the old site, I want a solution
> that will perform acceptably, and I haven’t found it yet.
>
> Thanks for any and all advice! 😊
>
> - Demian
>
--
Carol Kassel
Senior Manager, Digital Library Infrastructure NYU Digital Library Technology Services she/her/hers [log in to unmask]
(212) 992-9246
dlib.nyu.edu
|