Kyle -- Although my example doesn't apply for all file formats, it does give an example of what you're looking for: BWFMetaEdit ( http://www.digitizationguidelines.gov/guidelines/digitize-embedding.html) is free tool developed by Federal Agency groups to allow for the reading/writing of metadata into the BWF and RIFF (BEXT & INFO respectively) text chunks of WAV audio files. The salient point here is that this approach was designed with the ability to generate and embed a checksum of the PCM audio stream within the WAV container so that as new metadata are added to the container, the audio can be validated against its specific checksum, not a checksum of the entire container. In this practice, one can generate a checksum for the audio information (the content) and for the entire file itself (the content and the metadata). Take a read through that and maybe it will inspire some ideas. I know in the moving image field there is also much activity around frame by frame checksums for moving image material so that when a file is found to be corrupt, you can even pinpoint which frame has the corruption. Best -- Bert Bertram Lyons, CA AVPreserve | www.avpreserve.com American Folklife Center | www.loc.gov/folklife International Association of Sound and Audiovisual Archives | www.iasa-web.org On Mon, Jan 26, 2015 at 6:21 AM, Scancella, John <[log in to unmask]> wrote: > The library of congress has several tools for making and working with > bagit bags. > > Java command line tool and library > https://github.com/LibraryOfCongress/bagit-java > > a python command line tool and library > https://github.com/LibraryOfCongress/bagit-python > > or a standalone java desktop application (GUI based) > https://github.com/LibraryOfCongress/bagger > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Joe Hourcle > Sent: Saturday, January 24, 2015 10:07 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata > > On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote: > > > Howdy all, > > > > I've been toying with the idea of embedding DOI's in all our digital > > assets and possibly inserting/updating other metadata as well. > > However, doing this would alter checksums created using normal methods. > > > > Is there a practical/easy way to checksum only the objects themselves > > without the metadata? If the metadata in a tiff or other kind of file > > is modified, it does nothing to the actual object. Since providing > > more complete metadata within objects makes them more > > usable/identifiable and might simplify migrations down the road, it > > seems like this wouldn't be a bad way to go. > > > The only file format that I'm aware of that has a provision for this is > FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM' > and a 'DATASUM'. (the 'DATASUM' is the checksum for only the payload > portion, the 'CHECKSUM' includes the metadata)[1]. It's possible that > there are others, but I suspect that most consumer file formats won't have > specific provisions for this. > > The problems with 'metadata' in a lot of file formats is that they're just > arbitrary segments -- you'd have to have a program that knew which segments > were considered 'headers' vs. not. It might be easier to have it be able > to compute a separate checksum for each segment, so that should the > modifications change their order, they'd still be considered valid. > > Of course, I personally don't like changing files if I can help it. > If it were me, I'd keep the metadata outside the file; if you're using > BagIt, you could easily add additional metadata outside of the data > directory.[2] > > If you're just doing this internally, and don't need the DOI to be > attached to the file when it's served, you could also look into file > systems that support arbitrary metadata. Older Macs used to use this, > where there was a 'data fork' and a 'resource fork', but you had to have a > service that knew to only send the data fork. > Other OSes support forks, but some also have 'extended file attributes', > which allows you to attach a few key/value pairs to the file. (exact > limits are dependent upon the OS). > > -Joe > > > [1] http://fits.gsfc.nasa.gov/registry/checksum.html > [2] https://tools.ietf.org/html/draft-kunze-bagit ; > http://en.wikipedia.org/wiki/BagIt >