+1 to Alex's suggestion to use WARC for the preservation master and
generate PDFs for access.
While I agree with Kyle that it's ultimately the "content" that's
important and that hypothetical researcher needs are inexhaustible, I do
think there's an advantage to preserving web content in a web-native
way. Aside from verisimilitude, looking ahead to implementation of
Memento (http://mementoweb.org/) - a mechanism for adding temporal
navigation to the web through federated discovery of resources preserved
in distributed web archives - data stored in WARC will ultimately be
better integrated into the fabric of the web than PDFs siloed in an
individual institutional repository.
I also wanted to mention (and encourage addition to!) the Wikipedia list
of web archiving initiatives:
http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives. It
provides a good overview of many web archiving institutions' programs,
data formats, technology stacks, and access provisions (including links
to their Wayback implementations).
~Nicholas
--
Nicholas Taylor
Web Archiving Service Manager
Stanford University Libraries
|