Print

Print


On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:

> Howdy all,
> 
> I've been toying with the idea of embedding DOI's in all our digital assets
> and possibly inserting/updating other metadata as well. However, doing this
> would alter checksums created using normal methods.
> 
> Is there a practical/easy way to checksum only the objects themselves
> without the metadata? If the metadata in a tiff or other kind of file is
> modified, it does nothing to the actual object. Since providing more
> complete metadata within objects makes them more usable/identifiable and
> might simplify migrations down the road, it seems like this wouldn't be a
> bad way to go.


The only file format that I'm aware of that has a provision for this 
is FITS (Flexible Image Transport System), which was a concept of a 
'CHECKSUM' and a 'DATASUM'.  (the 'DATASUM' is the checksum for only
the payload portion, the 'CHECKSUM' includes the metadata)[1].  It's
possible that there are others, but I suspect that most consumer
file formats won't have specific provisions for this.

The problems with 'metadata' in a lot of file formats is that they're
just arbitrary segments -- you'd have to have a program that knew
which segments were considered 'headers' vs. not.  It might be easier
to have it be able to compute a separate checksum for each segment,
so that should the modifications change their order, they'd still
be considered valid.

Of course, I personally don't like changing files if I can help it.
If it were me, I'd keep the metadata outside the file;  if you're
using BagIt, you could easily add additional metadata outside of
the data directory.[2]

If you're just doing this internally, and don't need the DOI to be
attached to the file when it's served, you could also look into
file systems that support arbitrary metadata.  Older Macs used
to use this, where there was a 'data fork' and a 'resource fork',
but you had to have a service that knew to only send the data fork.
Other OSes support forks, but some also have 'extended file
attributes', which allows you to attach a few key/value pairs
to the file.  (exact limits are dependent upon the OS).

-Joe


[1] http://fits.gsfc.nasa.gov/registry/checksum.html
[2] https://tools.ietf.org/html/draft-kunze-bagit ; http://en.wikipedia.org/wiki/BagIt