|
|
Hi group,
Continuing the compression conversation, I have posted the JPEG 2000 research done for the FADGI group. While (obviously) focused on format compression and not storage compression, scanning the resources reveals some overlap as far as I/O and access issues. Also, I figured this would be a good resource to have on hand for future conversations as there are lots of institutional policies and research documents referenced (which are in a merged PDF for now, but we hope to upload individually at some point).
http://www.loc.gov/extranet/wiki/osi/ndiip/ndsa/index.php?title=Preservation_Storage_Topic_JPEG2000
Also, I broke out each topic onto its own page and updated the conversation on the wiki. Current compression conversation is at the following link but please keep discussing it on this thread and I'll post Priscilla's questions below for reference.
http://www.loc.gov/extranet/wiki/osi/ndiip/ndsa/index.php?title=Preservation_Storage_Topic_2:_Compression
Thanks,
Jefferson
----
Topic 2: Compression
Data compression can decrease the cost of long-term preservation by reducing the amount of storage required. There are at least three types of compression to consider: file compression, using a file compression algorithm suited to the file type
hardware compression, which usually means compression done by a tape drive as the data is written to tape disk compression, which is performed by many new storage appliances and uses a combination of compression and de-duplication
If there are other kinds of compression, please add that into this discussion.
For each of these types of compression:
1. Are you currently using this type of compression in your own archival storage (in the OAIS sense of long-term preservation storage)?
2. How do you feel about using this type of compression for archival storage? Is this legitimate or something that Best Practice would discourage?
3. What are the particular risks of this type of compression, if any, for preservation?
4. Are there any advantages to using this type of compression beyond reducing storage costs?
5. How do you trade off cost vs. risk?
Compressed formats are in general much more sensitive to data corruption than uncompressed formats. Due to the amplification' effect that compression has on data corruption, the percentage saving in storage space is often much less than the percentage increase in the amount of information that is affected by data corruption.
-----Original Message-----
From: The NDSA infrastructure working group list [mailto:[log in to unmask]] On Behalf Of Goethals, Andrea
Sent: Wednesday, March 07, 2012 3:46 PM
To: [log in to unmask]
Subject: Re: [NDSA-INFRASTRUCTURE] compression in preservation storage
I'm not sure this question is easier than the encryption one ;)
I'll answer the easier part for us first. We don't use any compression on disk or tape. One of the reasons we chose the storage technologies we're using now is because under the hood it aggregates the content in tar files which we are relatively comfortable with. File compression is another story...
I think it's helpful to discuss the digitized content we're preserving separately from the born digital content. I'll start with the digitized content. Within my department we don't digitize any content ourselves but we do provide format guidelines and some format requirements. It boils down to the fact that we will accept content in any format but we tell our customers that we have greater confidence in preserving uncompressed formats (I'm just referring to preservation copies, not access copies). There is only one compressed digitized format that we receive a lot of (JPEG2000 JP2) even though we still recommend TIFF uncompressed over JP2. The collection managers that choose to create JP2 over TIFF do it largely to reduce their storage fee (because they can use the JP2 as both the preservation and access format). We know that this is a risky area for us because there still isn't a lot of software support for JPEG2000 and because it's a fairly complicated format we have seen a lot of incorrect interpretations of the spec by software developers leading to invalid files. But, I think it's safe to say that because the future of JP2 is still unknown at this point, there's the chance that it could go in the direction of more acceptance/support. We are participating in an informal network of institutions/individuals who have a stake in seeing the success of JPEG2000 to lessen our risk there.
For born digital content we can't be as prescriptive as for digitized content (for obvious reasons). We're collecting Web content in any and all formats, many of which have compression. We also are collecting PDFs which can contain compressed images. And our email archiving project will introduce email attachments in all kinds of (potentially compressed) formats. We're also getting in images in compressed formats that came from digital cameras. We are more likely to have compressed formats for born digital content because the only copies are the access copies which tend to have compression.
Those are the cases where we receive compressed files. We also intentionally introduce lossless compression for some of our content. As is common in Web archiving, we store our Web content in our repository in compressed form (using Gzipped ARC files). In that case, we are doing what we think is the community standard for Web archives (except for the fact that the new standard is compressed WARC files instead of compressed ARC files). The tools built or that are being built for these types of files support the compression. We also compress our books digitized for the Google book project (ZIP files contain all the images and text for a particular book). In both the Web archive and the Google book case we are compressing them because they would take up too much storage space if we didn't. Both GZIP and ZIP are well-known widely supported lossless formats so we are relatively comfortable with them. There is one more case where we are intentionally compressing content - "opaque containers". We give our customers the option to deposit content in any format zipped up into what we call opaque containers. This is an option for people who want to save storage fees, only want bit-level preservation at the moment and are willing to accept the caveat that it won't receive the same technical characterization and they won't be able to have all the management functionality that they do for other content. We have not gotten a lot of use of the opaque container option because thankfully our customers really do want all the preservation and management services, but that option is available.
Having described the 3 cases above where we intentionally losslessly compress content to save space/storage fees, there are some real disadvantages to having done so from a management perspective. If you're familiar with the object portion of the PREMIS data model (http://www.loc.gov/standards/premis/v2/premis-report-2-1.pdf) it includes objects, files and bitstreams. Currently in our repository we only describe content at the object and file level, so at the most granular level we can only characterize these ARC/GZIP or ZIP files at that container level and not at the arguably more important level of the bitstreams within them. This is a real barrier to preservation planning and reporting. We do record a count of each MIME-type within these containers but this is a very crude description compared to what we have for the other content. We plan to add bitstream support to our repository next year so that we can fully describe and manage the bitstreams but I know that there will always be a lag in tools and infrastructure in what we can do for this content. I would not recommend it as a strategy for all of your content but I think it makes sense for certain categories of content.
Andrea
Andrea Goethals
Digital Preservation and Repository Services Manager Harvard Library Office for Information Systems [log in to unmask]<mailto:[log in to unmask]>
(617) 495-3724
> -----Original Message-----
> From: The NDSA infrastructure working group list [mailto:NDSA-
> [log in to unmask]<mailto:[log in to unmask]>] On Behalf Of Priscilla
> Caplan
> Sent: Tuesday, March 06, 2012 4:50 PM
> To: [log in to unmask]<mailto:[log in to unmask]>
> Subject: [NDSA-INFRASTRUCTURE] compression in preservation storage
>
> Time to start our next issue-oriented conversation, this time about
> data compression.
>
>
> Data compression can decrease the cost of long-term preservation by
> reducing the amount of storage required. There are at least three
> types of compression to consider:
>
> -- file compression, using a file compression algorithm suited to the
> file type
> -- hardware compression, which usually means compression done by a
> tape drive as the data is written to tape
> -- disk compression, which is performed by many new storage
> appliances and uses a combination of compression and deduplication
>
> If there are other kinds of compression, please add that into this
> discussion.
>
> *For each of these types of compression:
> *
>
> 1. Are you currently using this type of compression in your own
> archival storage (in the OAIS sense of long-term preservation storage)?
>
> 2. How do you feel about using this type of compression for archival
> storage? Is this legitimate or something that Best Practice would
> discourage?
>
> 3. What are the particular risks of this type of compression, if any,
> for preservation?
>
> 4. Are there any advantages to using this type of compression beyond
> reducing storage costs?
>
> 5. How do you trade off cost vs. risk?
>
>
> I looked for best practices or other documents that addressed
> compression in the preservation context. Below are some snippets of
> what I found, addressing mainly file and hardware compression. *
> *
>
> *
> Case Western University Archives:*
>
> Compression adds complexity to long-term preservation. Some
> compression techniques shed "redundant" information. As an example,
> JPEG removes information to reduce file size. The image might look
> fine on your current monitor, but as monitors improve, the lower
> quality of the image will be more obvious
>
> *Wright, Miller and Addis* in
> http://www.prestoprime.org/docs/training/Cost_of_risk_RW.pdf
>
> Not encoding, in particular not using compression, typically results
> in files that have minimal sensitivity to corruption. In this way,
> the choice not to use compression is a way to mitigate against loss.
>
> *TNA on Image Compression* in
> http://www.nationalarchives.gov.uk/documents/image_compression.pdf
>
> It is recommended that algorithms should only be used in the
> circumstances for which they are most efficient. It is also strongly
> recommended that archival master versions of images should only be
> created and stored using lossless algorithms. The Intellectual
> Property Rights status of a compression algorithm is primarily an
> issue for developers of format specifications, and software encoders/decoders.
> However, the use
> of open, non-proprietary compression techniques is recommended for the
> purposes of sustainability.
>
> *Howard Besser*, quoted in
> http://digitalpreservationstrategies.blogspot.com/
>
> Data is often compressed or "scrambled" to assist in its storage and
> or protect it's intellectual content. These compression and encryption
> algorythms are often developed by private organisations who will one
> day cease to support them. If this happens you're stuck between a rock
> and a hard place. If you don't want to get into legal trouble you are
> no longer able to read your data; and if you go ahead and "do the
> unwrapping yourself" it's quite possible you're breaking copyright law.
>
> *NINCH Guide to Good Practice*
> http://www.nyu.edu/its/humanities/ninchguide/XIV/
>
> A similar obsolescence problem will have to be addressed with the file
> formats and compression techniques you choose. Do not rely on
> proprietary file formats and compression techniques, which may not be
> supported in the future as the companies which produce them merge, go
> out of business or move on to new products. In the cultural heritage
> community, thede factostandard formats are uncompressed TIFF for
> images and PDF, ACSII (SGML/XML markup) and RTF for text. Migration to
> future versions of these formats is likely to be well-supported, since
> they are so widely used. Digital objects for preservation should not
> be stored in compressed or encrypted formats.
>
> *PRESTO Centre, Threats to Data Integrity from Large-Scale Management
> Environments*
> http://www.prestocentre.org/library/resources/threats-data-integrity-
> use-large-scale-management-environments
>
> Compressed formats are in general much more sensitive to data
> corruption than uncompressed formats. Due to the 'amplification'
> effect that compression has on data corruption, the percentage saving
> in storage space is often much less than the percentage increase in
> the amount of information that is affected by data corruption.
>
>
> Priscilla
>
>
> ############################
>
> To unsubscribe from the NDSA-INFRASTRUCTURE list:
> write to: mailto:NDSA-INFRASTRUCTURE-SIGNOFF-
> [log in to unmask]<mailto:[log in to unmask]>
> or click the following link:
> http://list.digitalpreservation.gov/SCRIPTS/WA-DIGITAL.EXE?SUBED1=NDSA
> -
> INFRASTRUCTURE&A=1
############################
To unsubscribe from the NDSA-INFRASTRUCTURE list:
write to: mailto:[log in to unmask]
or click the following link:
http://list.digitalpreservation.gov/SCRIPTS/WA-DIGITAL.EXE?SUBED1=NDSA-INFRASTRUCTURE&A=1
############################
To unsubscribe from the NDSA-INFRASTRUCTURE list:
write to: mailto:[log in to unmask]
or click the following link:
http://list.digitalpreservation.gov/SCRIPTS/WA-DIGITAL.EXE?SUBED1=NDSA-INFRASTRUCTURE&A=1
|
|
|
|
|
Archives |
September 2024 May 2024 March 2024 May 2023 March 2023 February 2023 September 2022 July 2022 June 2022 January 2022 December 2021 November 2021 October 2021 September 2021 August 2021 July 2021 June 2021 May 2021 April 2021 March 2021 February 2021 January 2021 December 2020 November 2020 October 2020 September 2020 August 2020 July 2020 June 2020 May 2020 April 2020 March 2020 February 2020 January 2020 December 2019 November 2019 September 2019 August 2019 July 2019 June 2019 May 2019 April 2019 March 2019 February 2019 January 2019 November 2018 October 2018 September 2018 August 2018 July 2018 June 2018 May 2018 April 2018 March 2018 February 2018 January 2018 December 2017 November 2017 October 2017 November 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 November 2014 October 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014, Week 2 March 2014 February 2014, Week 4 February 2014, Week 3 February 2014, Week 2 February 2014, Week 1 January 2014, Week 5 January 2014, Week 4 January 2014, Week 1 December 2013, Week 2 November 2013, Week 4 November 2013, Week 2 October 2013, Week 5 October 2013, Week 3 October 2013, Week 1 September 2013, Week 3 August 2013, Week 5 August 2013, Week 4 August 2013, Week 1 July 2013, Week 5 June 2013, Week 4 June 2013, Week 3 June 2013, Week 2 May 2013, Week 3 April 2013, Week 5 April 2013, Week 4 April 2013, Week 3 March 2013, Week 4 March 2013, Week 3 March 2013, Week 2 March 2013, Week 1 February 2013, Week 4 February 2013, Week 3 February 2013, Week 2 February 2013, Week 1 January 2013, Week 5 January 2013, Week 4 January 2013, Week 3 January 2013, Week 2 December 2012, Week 2 November 2012, Week 4 November 2012, Week 3 November 2012, Week 2 November 2012, Week 1 October 2012, Week 5 October 2012, Week 3 October 2012, Week 2 October 2012, Week 1 September 2012, Week 4 September 2012, Week 3 August 2012, Week 5 August 2012, Week 4 August 2012, Week 3 August 2012, Week 2 August 2012, Week 1 July 2012, Week 5 July 2012, Week 2 June 2012, Week 4 June 2012, Week 3 May 2012, Week 5 May 2012, Week 3 May 2012, Week 2 May 2012, Week 1 April 2012, Week 5 April 2012, Week 4 April 2012, Week 3 April 2012, Week 2 March 2012, Week 5 March 2012, Week 4 March 2012, Week 2 March 2012, Week 1 February 2012, Week 4 February 2012, Week 3 February 2012, Week 2 February 2012, Week 1 January 2012, Week 5 January 2012, Week 4 January 2012, Week 3 January 2012, Week 1 December 2011, Week 3 December 2011, Week 2 December 2011, Week 1 November 2011, Week 4 November 2011, Week 3 November 2011, Week 2 November 2011, Week 1 October 2011, Week 3 September 2011, Week 4 September 2011, Week 3 September 2011, Week 1 August 2011, Week 5 August 2011, Week 4 August 2011, Week 3 August 2011, Week 1 July 2011, Week 5 July 2011, Week 4 July 2011, Week 1 June 2011, Week 3 June 2011, Week 2 June 2011, Week 1 May 2011, Week 4 May 2011, Week 2 May 2011, Week 1 April 2011, Week 3 April 2011, Week 2 April 2011, Week 1 March 2011, Week 5 March 2011, Week 4 March 2011, Week 3 March 2011, Week 2 February 2011, Week 4 February 2011, Week 3 February 2011, Week 2 February 2011, Week 1 January 2011, Week 4 January 2011, Week 3 January 2011, Week 1 December 2010, Week 3 December 2010, Week 2 October 2010, Week 2 September 2010, Week 3 September 2010, Week 2 September 2010, Week 1 August 2010, Week 5
|
|