LISTSERV mailing list manager LISTSERV 16.5

Help for NDSA-INFRASTRUCTURE Archives


NDSA-INFRASTRUCTURE Archives

NDSA-INFRASTRUCTURE Archives


NDSA-INFRASTRUCTURE@LISTS.CLIR.ORG


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

NDSA-INFRASTRUCTURE Home

NDSA-INFRASTRUCTURE Home

NDSA-INFRASTRUCTURE  March 2012, Week 1

NDSA-INFRASTRUCTURE March 2012, Week 1

Subject:

Re: compression in preservation storage

From:

"Goethals, Andrea" <[log in to unmask]>

Reply-To:

The NDSA infrastructure working group list <[log in to unmask]>

Date:

Wed, 7 Mar 2012 20:45:55 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (168 lines)

I'm not sure this question is easier than the encryption one ;)

I'll answer the easier part for us first. We don't use any compression on disk or tape. One of the reasons we chose the storage technologies we're using now is because under the hood it aggregates the content in tar files which we are relatively comfortable with. File compression is another story...

I think it's helpful to discuss the digitized content we're preserving separately from the born digital content. I'll start with the digitized content. Within my department we don't digitize any content ourselves but we do provide format guidelines and some format requirements. It boils down to the fact that we will accept content in any format but we tell our customers that we have greater confidence in preserving uncompressed formats (I'm just referring to preservation copies, not access copies). There is only one compressed digitized format that we receive a lot of (JPEG2000 JP2) even though we still recommend TIFF uncompressed over JP2. The collection managers that choose to create JP2 over TIFF do it largely to reduce their storage fee (because they can use the JP2 as both the preservation and access format). We know that this is a risky area for us because there still isn't a lot of software support for JPEG2000 and because it's a fairly complicated format we have seen a lot of incorrect interpretations of the spec by software developers leading to invalid files. But, I think it's safe to say that because the future of JP2 is still unknown at this point, there's the chance that it could go in the direction of more acceptance/support. We are participating in an informal network of institutions/individuals who have a stake in seeing the success of JPEG2000 to lessen our risk there.

For born digital content we can't be as prescriptive as for digitized content (for obvious reasons). We're collecting Web content in any and all formats, many of which have compression. We also are collecting PDFs which can contain compressed images. And our email archiving project will introduce email attachments in all kinds of (potentially compressed) formats. We're also getting in images in compressed formats that came from digital cameras. We are more likely to have compressed formats for born digital content because the only copies are the access copies which tend to have compression.

Those are the cases where we receive compressed files. We also intentionally introduce lossless compression for some of our content. As is common in Web archiving, we store our Web content in our repository in compressed form (using Gzipped ARC files). In that case, we are doing what we think is the community standard for Web archives (except for the fact that the new standard is compressed WARC files instead of compressed ARC files). The tools built or that are being built for these types of files support the compression. We also compress our books digitized for the Google book project (ZIP files contain all the images and text for a particular book). In both the Web archive and the Google book case we are compressing them because they would take up too much storage space if we didn't. Both GZIP and ZIP are well-known widely supported lossless formats so we are relatively comfortable with them. There is one more case where we are intentionally compressing content - "opaque containers". We give our customers the option to deposit content in any format zipped up into what we call opaque containers. This is an option for people who want to save storage fees, only want bit-level preservation at the moment and are willing to accept the caveat that it won't receive the same technical characterization and they won't be able to have all the management functionality that they do for other content. We have not gotten a lot of use of the opaque container option because thankfully our customers really do want all the preservation and management services, but that option is available.

Having described the 3 cases above where we intentionally losslessly compress content to save space/storage fees, there are some real disadvantages to having done so from a management perspective. If you're familiar with the object portion of the PREMIS data model (http://www.loc.gov/standards/premis/v2/premis-report-2-1.pdf) it includes objects, files and bitstreams. Currently in our repository we only describe content at the object and file level, so at the most granular level we can only characterize these ARC/GZIP or ZIP files at that container level and not at the arguably more important level of the bitstreams within them. This is a real barrier to preservation planning and reporting. We do record a count of each MIME-type within these containers but this is a very crude description compared to what we have for the other content. We plan to add bitstream support to our repository next year so that we can fully describe and manage the bitstreams but I know that there will always be a lag in tools and infrastructure in what we can do for this content. I would not recommend it as a strategy for all of your content but I think it makes sense for certain categories of content.

Andrea


Andrea Goethals
Digital Preservation and Repository Services Manager
Harvard Library Office for Information Systems
[log in to unmask]
(617) 495-3724





> -----Original Message-----
> From: The NDSA infrastructure working group list [mailto:NDSA-
> [log in to unmask]] On Behalf Of Priscilla
> Caplan
> Sent: Tuesday, March 06, 2012 4:50 PM
> To: [log in to unmask]
> Subject: [NDSA-INFRASTRUCTURE] compression in preservation storage
>
> Time to start our next issue-oriented conversation, this time about
> data
> compression.
>
>
> Data compression can decrease the cost of long-term preservation by
> reducing the amount of storage required. There are at least three
> types
> of compression to consider:
>
> -- file compression, using a file compression algorithm suited to the
> file type
> -- hardware compression, which usually means compression done by a
> tape
> drive as the data is written to tape
> -- disk compression, which is performed by many new storage appliances
> and uses a combination of compression and deduplication
>
> If there are other kinds of compression, please add that into this
> discussion.
>
> *For each of these types of compression:
> *
>
> 1. Are you currently using this type of compression in your own
> archival storage (in the OAIS sense of long-term preservation storage)?
>
> 2. How do you feel about using this type of compression for archival
> storage? Is this legitimate or something that Best Practice would
> discourage?
>
> 3. What are the particular risks of this type of compression, if any,
> for preservation?
>
> 4. Are there any advantages to using this type of compression beyond
> reducing storage costs?
>
> 5. How do you trade off cost vs. risk?
>
>
> I looked for best practices or other documents that addressed
> compression in the preservation context. Below are some snippets of
> what I found, addressing mainly file and hardware compression. *
> *
>
> *
> Case Western University Archives:*
>
> Compression adds complexity to long-term preservation. Some compression
> techniques shed "redundant" information. As an example, JPEG removes
> information to reduce file size. The image might look fine on your
> current monitor, but as monitors improve, the lower quality of the
> image
> will be more obvious
>
> *Wright, Miller and Addis* in
> http://www.prestoprime.org/docs/training/Cost_of_risk_RW.pdf
>
> Not encoding, in particular not using compression, typically results in
> files that have minimal sensitivity to corruption. In this way, the
> choice not to use compression is a way to mitigate against loss.
>
> *TNA on Image Compression* in
> http://www.nationalarchives.gov.uk/documents/image_compression.pdf
>
> It is recommended that algorithms should only be used in the
> circumstances for which they are most efficient. It is also strongly
> recommended that archival master versions of images should only be
> created and stored using lossless algorithms. The Intellectual Property
> Rights status of a compression algorithm is primarily an issue for
> developers of format specifications, and software encoders/decoders.
> However, the use
> of open, non-proprietary compression techniques is recommended for the
> purposes of sustainability.
>
> *Howard Besser*, quoted in
> http://digitalpreservationstrategies.blogspot.com/
>
> Data is often compressed or "scrambled" to assist in its storage and or
> protect it's intellectual content. These compression and encryption
> algorythms are often developed by private organisations who will one
> day
> cease to support them. If this happens you're stuck between a rock and
> a
> hard place. If you don't want to get into legal trouble you are no
> longer able to read your data; and if you go ahead and "do the
> unwrapping yourself" it's quite possible you're breaking copyright law.
>
> *NINCH Guide to Good Practice*
> http://www.nyu.edu/its/humanities/ninchguide/XIV/
>
> A similar obsolescence problem will have to be addressed with the file
> formats and compression techniques you choose. Do not rely on
> proprietary file formats and compression techniques, which may not be
> supported in the future as the companies which produce them merge, go
> out of business or move on to new products. In the cultural heritage
> community, thede factostandard formats are uncompressed TIFF for images
> and PDF, ACSII (SGML/XML markup) and RTF for text. Migration to future
> versions of these formats is likely to be well-supported, since they
> are
> so widely used. Digital objects for preservation should not be stored
> in
> compressed or encrypted formats.
>
> *PRESTO Centre, Threats to Data Integrity from Large-Scale Management
> Environments*
> http://www.prestocentre.org/library/resources/threats-data-integrity-
> use-large-scale-management-environments
>
> Compressed formats are in general much more sensitive to data
> corruption
> than uncompressed formats. Due to the 'amplification' effect that
> compression has on data corruption, the percentage saving in storage
> space is often much less than the percentage increase in the amount of
> information that is affected by data corruption.
>
>
> Priscilla
>
>
> ############################
>
> To unsubscribe from the NDSA-INFRASTRUCTURE list:
> write to: mailto:NDSA-INFRASTRUCTURE-SIGNOFF-
> [log in to unmask]
> or click the following link:
> http://list.digitalpreservation.gov/SCRIPTS/WA-DIGITAL.EXE?SUBED1=NDSA-
> INFRASTRUCTURE&A=1

############################

To unsubscribe from the NDSA-INFRASTRUCTURE list:
write to: mailto:[log in to unmask]
or click the following link:
http://list.digitalpreservation.gov/SCRIPTS/WA-DIGITAL.EXE?SUBED1=NDSA-INFRASTRUCTURE&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2024
May 2023
March 2023
February 2023
September 2022
July 2022
June 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
November 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
November 2014
October 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014, Week 2
March 2014
February 2014, Week 4
February 2014, Week 3
February 2014, Week 2
February 2014, Week 1
January 2014, Week 5
January 2014, Week 4
January 2014, Week 1
December 2013, Week 2
November 2013, Week 4
November 2013, Week 2
October 2013, Week 5
October 2013, Week 3
October 2013, Week 1
September 2013, Week 3
August 2013, Week 5
August 2013, Week 4
August 2013, Week 1
July 2013, Week 5
June 2013, Week 4
June 2013, Week 3
June 2013, Week 2
May 2013, Week 3
April 2013, Week 5
April 2013, Week 4
April 2013, Week 3
March 2013, Week 4
March 2013, Week 3
March 2013, Week 2
March 2013, Week 1
February 2013, Week 4
February 2013, Week 3
February 2013, Week 2
February 2013, Week 1
January 2013, Week 5
January 2013, Week 4
January 2013, Week 3
January 2013, Week 2
December 2012, Week 2
November 2012, Week 4
November 2012, Week 3
November 2012, Week 2
November 2012, Week 1
October 2012, Week 5
October 2012, Week 3
October 2012, Week 2
October 2012, Week 1
September 2012, Week 4
September 2012, Week 3
August 2012, Week 5
August 2012, Week 4
August 2012, Week 3
August 2012, Week 2
August 2012, Week 1
July 2012, Week 5
July 2012, Week 2
June 2012, Week 4
June 2012, Week 3
May 2012, Week 5
May 2012, Week 3
May 2012, Week 2
May 2012, Week 1
April 2012, Week 5
April 2012, Week 4
April 2012, Week 3
April 2012, Week 2
March 2012, Week 5
March 2012, Week 4
March 2012, Week 2
March 2012, Week 1
February 2012, Week 4
February 2012, Week 3
February 2012, Week 2
February 2012, Week 1
January 2012, Week 5
January 2012, Week 4
January 2012, Week 3
January 2012, Week 1
December 2011, Week 3
December 2011, Week 2
December 2011, Week 1
November 2011, Week 4
November 2011, Week 3
November 2011, Week 2
November 2011, Week 1
October 2011, Week 3
September 2011, Week 4
September 2011, Week 3
September 2011, Week 1
August 2011, Week 5
August 2011, Week 4
August 2011, Week 3
August 2011, Week 1
July 2011, Week 5
July 2011, Week 4
July 2011, Week 1
June 2011, Week 3
June 2011, Week 2
June 2011, Week 1
May 2011, Week 4
May 2011, Week 2
May 2011, Week 1
April 2011, Week 3
April 2011, Week 2
April 2011, Week 1
March 2011, Week 5
March 2011, Week 4
March 2011, Week 3
March 2011, Week 2
February 2011, Week 4
February 2011, Week 3
February 2011, Week 2
February 2011, Week 1
January 2011, Week 4
January 2011, Week 3
January 2011, Week 1
December 2010, Week 3
December 2010, Week 2
October 2010, Week 2
September 2010, Week 3
September 2010, Week 2
September 2010, Week 1
August 2010, Week 5

ATOM RSS1 RSS2



LISTS.CLIR.ORG

CataList Email List Search Powered by the LISTSERV Email List Manager