Although my example doesn't apply for all file formats, it does give an
example of what you're looking for:
is free tool developed by Federal Agency groups to allow for the
reading/writing of metadata into the BWF and RIFF (BEXT & INFO
respectively) text chunks of WAV audio files. The salient point here is
that this approach was designed with the ability to generate and embed a
checksum of the PCM audio stream within the WAV container so that as new
metadata are added to the container, the audio can be validated against its
specific checksum, not a checksum of the entire container. In this
practice, one can generate a checksum for the audio information (the
content) and for the entire file itself (the content and the metadata).
Take a read through that and maybe it will inspire some ideas.
I know in the moving image field there is also much activity around frame
by frame checksums for moving image material so that when a file is found
to be corrupt, you can even pinpoint which frame has the corruption.
Bertram Lyons, CA
AVPreserve | www.avpreserve.com
American Folklife Center | www.loc.gov/folklife
International Association of Sound and Audiovisual Archives |
On Mon, Jan 26, 2015 at 6:21 AM, Scancella, John <[log in to unmask]> wrote:
> The library of congress has several tools for making and working with
> bagit bags.
> Java command line tool and library
> a python command line tool and library
> or a standalone java desktop application (GUI based)
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Joe Hourcle
> Sent: Saturday, January 24, 2015 10:07 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata
> On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:
> > Howdy all,
> > I've been toying with the idea of embedding DOI's in all our digital
> > assets and possibly inserting/updating other metadata as well.
> > However, doing this would alter checksums created using normal methods.
> > Is there a practical/easy way to checksum only the objects themselves
> > without the metadata? If the metadata in a tiff or other kind of file
> > is modified, it does nothing to the actual object. Since providing
> > more complete metadata within objects makes them more
> > usable/identifiable and might simplify migrations down the road, it
> > seems like this wouldn't be a bad way to go.
> The only file format that I'm aware of that has a provision for this is
> FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM'
> and a 'DATASUM'. (the 'DATASUM' is the checksum for only the payload
> portion, the 'CHECKSUM' includes the metadata). It's possible that
> there are others, but I suspect that most consumer file formats won't have
> specific provisions for this.
> The problems with 'metadata' in a lot of file formats is that they're just
> arbitrary segments -- you'd have to have a program that knew which segments
> were considered 'headers' vs. not. It might be easier to have it be able
> to compute a separate checksum for each segment, so that should the
> modifications change their order, they'd still be considered valid.
> Of course, I personally don't like changing files if I can help it.
> If it were me, I'd keep the metadata outside the file; if you're using
> BagIt, you could easily add additional metadata outside of the data
> If you're just doing this internally, and don't need the DOI to be
> attached to the file when it's served, you could also look into file
> systems that support arbitrary metadata. Older Macs used to use this,
> where there was a 'data fork' and a 'resource fork', but you had to have a
> service that knew to only send the data fork.
> Other OSes support forks, but some also have 'extended file attributes',
> which allows you to attach a few key/value pairs to the file. (exact
> limits are dependent upon the OS).
>  http://fits.gsfc.nasa.gov/registry/checksum.html
>  https://tools.ietf.org/html/draft-kunze-bagit ;