Also just stumbled across this on stackoverflow. http://stackoverflow.com/questions/12115824/compute-the-hash-of-only-the-core-image-data-of-a-tiff On Wed, Jan 28, 2015 at 10:32 AM, Ronald Houk < [log in to unmask]> wrote: > Hello, > > I like Danielle's idea. I wonder if it wouldn't be a good idea to > decouple the metadata from the data permanently. Exiftool allows you to > export the metadata in lots of different formats like JSON. You could > export the metadata into JSON, run the checksums and then store the photo > and the JSON file in a single tar-ball. From there you could use a JSON > editor to modify/add metadata. > > It would be simple to reintroduce the metadata into the file when needed. > > On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer <[log in to unmask]> > wrote: > >> Kyle, >> >> It's a bit of a hack, but you could write a script to delete all the >> metadata from images with ExifTool and then run checksums on the resulting >> image (see >> http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0). >> exiv2 might also work. I don't think you'd want to do that every time you >> audited the files, though; generating new checksums is a faster approach. >> >> I haven't tried this, but I know that there's a program called ssdeep >> developed for the digital forensics community that can do piecewise >> hashing >> -- it hashes chunks of content and then compares the hashes for the >> different chunks to find matches, in theory. It might be able to match >> files with embedded metadata vs. files without; the use cases described on >> the forensics wiki is finding altered (truncated) files, or reuse of >> source >> code. http://www.forensicswiki.org/wiki/Ssdeep >> >> Danielle Cunniff Plumer >> >> On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee <[log in to unmask]> >> wrote: >> >> > On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]> >> > wrote: >> > >> > > >> > > - How is your content packaged? >> > > - Are you talking about the SIPs or the AIPs or both? >> > > - Is your content in an instance of Fedora, a unix file structure, >> or >> > > something else? >> > > - Are you generating checksums on the whole package, parts of it, >> > both? >> > > >> > >> > The quick answer to this is that this is a low tech operation. We're >> > currently on regular filesystems where we are limited to feeding md5 >> > checksums into a list. I'm looking for a low tech way that makes it >> easier >> > to keep track of resources across a variety of platforms in a >> decentralized >> > environment and which will easily adopt to future technology >> transitions. >> > For example, we have a bunch of stuff in Bepress and Omeka. Neither of >> > those is good for preservation, so authoritative files live elsewhere >> as do >> > a huge number of resources that aren't in these platforms. Filenames are >> > terrible identifiers and things get moved around even if people don't >> mess >> > with the files. >> > >> > We also are trying to come up with something that deals with different >> > kinds of datasets (we're focusing on bioimaging at the moment) and fits >> in >> > the workflow of campus units, each of which needs to manage tens of >> > thousands of files with very little metadata on regular filesystems. >> Some >> > of the resources are enormous in terms of size or number of members. >> > >> > Simply embedding an identifier in the file is a really easy way to tell >> > which files have metadata and which metadata is there. In the case at >> hand, >> > I could just do that and generate new checksums. But I think the generic >> > problem of making better use of embedded metadata is an interesting one >> as >> > it can make objects more usable and understandable once they're removed. >> > For example, just this past Friday I received a request to use an image >> > someone downloaded for a book. Unfortunately, he just emailed me a copy >> of >> > the image, described what he wanted to do, and asked for permission but >> he >> > couldn't replicate how he found it. An identifier would have been handy >> as >> > would have been embedded rights info as this is not the same for all of >> our >> > images. The reason we're using DOI's is that they work well for anything >> > and can easily be recognized by syntax wherever they may appear. >> > >> > On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle < >> > [log in to unmask]> >> > wrote: >> > >> > > >> > > The problems with 'metadata' in a lot of file formats is that they're >> > > just arbitrary segments -- you'd have to have a program that knew >> > > which segments were considered 'headers' vs. not. It might be easier >> > > to have it be able to compute a separate checksum for each segment, >> > > so that should the modifications change their order, they'd still >> > > be considered valid. >> > > >> > >> > This is what I seemed to be bumping up against so I was hoping there >> was an >> > easy workaround. But this is helpful information. Thanks, >> > >> > kyle >> > >> > > > > -- > Ronald Houk > Assistant Director > Ottumwa Public Library > 102 W. Fourth Street > Ottumwa, IA 52501 > (641)682-7563x203 > [log in to unmask] > -- Ronald Houk Assistant Director Ottumwa Public Library 102 W. Fourth Street Ottumwa, IA 52501 (641)682-7563x203 [log in to unmask]