Print

Print


The library of congress has several tools for making and working with bagit bags.

Java command line tool and library
https://github.com/LibraryOfCongress/bagit-java

a python command line tool and library
https://github.com/LibraryOfCongress/bagit-python

or a standalone java desktop application (GUI based)
https://github.com/LibraryOfCongress/bagger 

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Joe Hourcle
Sent: Saturday, January 24, 2015 10:07 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata

On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:

> Howdy all,
> 
> I've been toying with the idea of embedding DOI's in all our digital 
> assets and possibly inserting/updating other metadata as well. 
> However, doing this would alter checksums created using normal methods.
> 
> Is there a practical/easy way to checksum only the objects themselves 
> without the metadata? If the metadata in a tiff or other kind of file 
> is modified, it does nothing to the actual object. Since providing 
> more complete metadata within objects makes them more 
> usable/identifiable and might simplify migrations down the road, it 
> seems like this wouldn't be a bad way to go.


The only file format that I'm aware of that has a provision for this is FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM' and a 'DATASUM'.  (the 'DATASUM' is the checksum for only the payload portion, the 'CHECKSUM' includes the metadata)[1].  It's possible that there are others, but I suspect that most consumer file formats won't have specific provisions for this.

The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not.  It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid.

Of course, I personally don't like changing files if I can help it.
If it were me, I'd keep the metadata outside the file;  if you're using BagIt, you could easily add additional metadata outside of the data directory.[2]

If you're just doing this internally, and don't need the DOI to be attached to the file when it's served, you could also look into file systems that support arbitrary metadata.  Older Macs used to use this, where there was a 'data fork' and a 'resource fork', but you had to have a service that knew to only send the data fork.
Other OSes support forks, but some also have 'extended file attributes', which allows you to attach a few key/value pairs to the file.  (exact limits are dependent upon the OS).

-Joe


[1] http://fits.gsfc.nasa.gov/registry/checksum.html
[2] https://tools.ietf.org/html/draft-kunze-bagit ; http://en.wikipedia.org/wiki/BagIt