Print

Print


Kyle,

It's a bit of a hack, but you could write a script to delete all the
metadata from images with ExifTool and then run checksums on the resulting
image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0).
exiv2 might also work. I don't think you'd want to do that every time you
audited the files, though; generating new checksums is a faster approach.

I haven't tried this, but I know that there's a program called ssdeep
developed for the digital forensics community that can do piecewise hashing
-- it hashes chunks of content and then compares the hashes for the
different chunks to find matches, in theory. It might be able to match
files with embedded metadata vs. files without; the use cases described on
the forensics wiki is finding altered (truncated) files, or reuse of source
code.  http://www.forensicswiki.org/wiki/Ssdeep

Danielle Cunniff Plumer

On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee <[log in to unmask]>
wrote:

> On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]>
> wrote:
>
> >
> >    - How is your content packaged?
> >    - Are you talking about the SIPs or the AIPs or both?
> >    - Is your content in an instance of Fedora, a unix file structure, or
> >    something else?
> >    - Are you generating checksums on the whole package, parts of it,
> both?
> >
>
> The quick answer to this is that this is a low tech operation. We're
> currently on regular filesystems where we are limited to feeding md5
> checksums into a list. I'm looking for a low tech way that makes it easier
> to keep track of resources across a variety of platforms in a decentralized
> environment and which will easily adopt to future technology transitions.
> For example, we have a bunch of stuff in Bepress and Omeka. Neither of
> those is good for preservation, so authoritative files live elsewhere as do
> a huge number of resources that aren't in these platforms. Filenames are
> terrible identifiers and things get moved around even if people don't mess
> with the files.
>
> We also are trying to come up with something that deals with different
> kinds of datasets (we're focusing on bioimaging at the moment) and fits in
> the workflow of campus units, each of which needs to manage tens of
> thousands of files with very little metadata on regular filesystems. Some
> of the resources are enormous in terms of size or number of members.
>
> Simply embedding an identifier in the file is a really easy way to tell
> which files have metadata and which metadata is there. In the case at hand,
> I could just do that and generate new checksums. But I think the generic
> problem of making better use of embedded metadata is an interesting one as
> it can make objects more usable and understandable once they're removed.
> For example, just this past Friday I received a request to use an image
> someone downloaded for a book. Unfortunately, he just emailed me a copy of
> the image, described what he wanted to do, and asked for permission but he
> couldn't replicate how he found it. An identifier would have been handy as
> would have been embedded rights info as this is not the same for all of our
> images. The reason we're using DOI's is that they work well for anything
> and can easily be recognized by syntax wherever they may appear.
>
> On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle <
> [log in to unmask]>
>  wrote:
>
> >
> > The problems with 'metadata' in a lot of file formats is that they're
> > just arbitrary segments -- you'd have to have a program that knew
> > which segments were considered 'headers' vs. not.  It might be easier
> > to have it be able to compute a separate checksum for each segment,
> > so that should the modifications change their order, they'd still
> > be considered valid.
> >
>
> This is what I seemed to be bumping up against so I was hoping there was an
> easy workaround. But this is helpful information. Thanks,
>
> kyle
>