Print

Print


+1 to Alex's suggestion to use WARC for the preservation master and 
generate PDFs for access.

While I agree with Kyle that it's ultimately the "content" that's 
important and that hypothetical researcher needs are inexhaustible, I do 
think there's an advantage to preserving web content in a web-native 
way. Aside from verisimilitude, looking ahead to implementation of 
Memento (http://mementoweb.org/) - a mechanism for adding temporal 
navigation to the web through federated discovery of resources preserved 
in distributed web archives - data stored in WARC will ultimately be 
better integrated into the fabric of the web than PDFs siloed in an 
individual institutional repository.

I also wanted to mention (and encourage addition to!) the Wikipedia list 
of web archiving initiatives: 
http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives. It 
provides a good overview of many web archiving institutions' programs, 
data formats, technology stacks, and access provisions (including links 
to their Wayback implementations).

~Nicholas
-- 
Nicholas Taylor
Web Archiving Service Manager
Stanford University Libraries