Print

Print


Bear in mind, too, that standards like this are written for people who 
will be writing low level software tools to work with the WARC format.  
If WARC gets widely adopted, the crucial knowledge for users will be 
refined out and presented in more digestible forms.  (Personally, I'll 
wait for W3Schools to do a tutorial.)

There's also usually a significant delay between striking a standard and 
the wide availability of the needed tools for common uses and 
platforms.  (As an exercise, list all the browsers fully implementing 
[Pick your favorite standard here.  I'll go with:-] XSLT, which became a 
standard in 1999.)

Chris Gray
Library Systems
University of Waterloo

"The nice thing about standards is that you have so many to choose from."
-Andrew Tanenbaum



[log in to unmask] wrote:
> hi Karen,
>
> understood.
>
> the final draft of the spec is available here:
> http://www.scribd.com/doc/4303719/WARC-ISO-28500-final-draft-v018-Zentveld-080618 
>
>
> and other (similar) versions here:
> http://archive-access.sourceforge.net/warc/
>
>
> [log in to unmask]
>
>
>
> On 6/2/09 2:15 PM, Karen Coyle wrote:
>> Unfortunately, being an ISO standard, to obtain it costs 118 CHF 
>> (about $110 USD). Hard to follow a standard you can't afford to read. 
>> Is there an online version somewhere?
>>
>> kc
>>
>> [log in to unmask] wrote:
>>> hi code4lib,
>>>
>>> if you're archiving web content, please use the WARC format.
>>>
>>> thanks,
>>> [log in to unmask]
>>>
>>>
>>>
>>> WARC File Format Published as an International Standard
>>> http://netpreserve.org/press/pr20090601.php
>>>
>>> ISO 28500:2009 specifies the WARC file format:
>>>
>>> * to store both the payload content and control information from
>>>   mainstream Internet application layer protocols, such as the
>>>   Hypertext Transfer Protocol (HTTP), Domain Name System (DNS),
>>>   and File Transfer Protocol (FTP);
>>> * to store arbitrary metadata linked to other stored data
>>>   (e.g. subject classifier, discovered language, encoding);
>>> * to support data compression and maintain data record integrity;
>>> * to store all control information from the harvesting protocol
>>>   (e.g. request headers), not just response information;
>>> * to store the results of data transformations linked to other
>>>   stored data;
>>> * to store a duplicate detection event linked to other stored
>>>   data (to reduce storage in the presence of identical or
>>>   substantially similar resources);
>>> * to be extended without disruption to existing functionality;
>>> * to support handling of overly long records by truncation or
>>>   segmentation, where desired.
>>>
>>>
>>> more info here:
>>> http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
>>>
>>>
>>
>>