Hello,
I like Danielle's idea. I wonder if it wouldn't be a good idea to decouple
the metadata from the data permanently. Exiftool allows you to export the
metadata in lots of different formats like JSON. You could export the
metadata into JSON, run the checksums and then store the photo and the JSON
file in a single tar-ball. From there you could use a JSON editor to
modify/add metadata.
It would be simple to reintroduce the metadata into the file when needed.
On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer <[log in to unmask]>
wrote:
> Kyle,
>
> It's a bit of a hack, but you could write a script to delete all the
> metadata from images with ExifTool and then run checksums on the resulting
> image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0
> ).
> exiv2 might also work. I don't think you'd want to do that every time you
> audited the files, though; generating new checksums is a faster approach.
>
> I haven't tried this, but I know that there's a program called ssdeep
> developed for the digital forensics community that can do piecewise hashing
> -- it hashes chunks of content and then compares the hashes for the
> different chunks to find matches, in theory. It might be able to match
> files with embedded metadata vs. files without; the use cases described on
> the forensics wiki is finding altered (truncated) files, or reuse of source
> code. http://www.forensicswiki.org/wiki/Ssdeep
>
> Danielle Cunniff Plumer
>
> On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee <[log in to unmask]>
> wrote:
>
> > On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]>
> > wrote:
> >
> > >
> > > - How is your content packaged?
> > > - Are you talking about the SIPs or the AIPs or both?
> > > - Is your content in an instance of Fedora, a unix file structure,
> or
> > > something else?
> > > - Are you generating checksums on the whole package, parts of it,
> > both?
> > >
> >
> > The quick answer to this is that this is a low tech operation. We're
> > currently on regular filesystems where we are limited to feeding md5
> > checksums into a list. I'm looking for a low tech way that makes it
> easier
> > to keep track of resources across a variety of platforms in a
> decentralized
> > environment and which will easily adopt to future technology transitions.
> > For example, we have a bunch of stuff in Bepress and Omeka. Neither of
> > those is good for preservation, so authoritative files live elsewhere as
> do
> > a huge number of resources that aren't in these platforms. Filenames are
> > terrible identifiers and things get moved around even if people don't
> mess
> > with the files.
> >
> > We also are trying to come up with something that deals with different
> > kinds of datasets (we're focusing on bioimaging at the moment) and fits
> in
> > the workflow of campus units, each of which needs to manage tens of
> > thousands of files with very little metadata on regular filesystems. Some
> > of the resources are enormous in terms of size or number of members.
> >
> > Simply embedding an identifier in the file is a really easy way to tell
> > which files have metadata and which metadata is there. In the case at
> hand,
> > I could just do that and generate new checksums. But I think the generic
> > problem of making better use of embedded metadata is an interesting one
> as
> > it can make objects more usable and understandable once they're removed.
> > For example, just this past Friday I received a request to use an image
> > someone downloaded for a book. Unfortunately, he just emailed me a copy
> of
> > the image, described what he wanted to do, and asked for permission but
> he
> > couldn't replicate how he found it. An identifier would have been handy
> as
> > would have been embedded rights info as this is not the same for all of
> our
> > images. The reason we're using DOI's is that they work well for anything
> > and can easily be recognized by syntax wherever they may appear.
> >
> > On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle <
> > [log in to unmask]>
> > wrote:
> >
> > >
> > > The problems with 'metadata' in a lot of file formats is that they're
> > > just arbitrary segments -- you'd have to have a program that knew
> > > which segments were considered 'headers' vs. not. It might be easier
> > > to have it be able to compute a separate checksum for each segment,
> > > so that should the modifications change their order, they'd still
> > > be considered valid.
> > >
> >
> > This is what I seemed to be bumping up against so I was hoping there was
> an
> > easy workaround. But this is helpful information. Thanks,
> >
> > kyle
> >
>
--
Ronald Houk
Assistant Director
Ottumwa Public Library
102 W. Fourth Street
Ottumwa, IA 52501
(641)682-7563x203
[log in to unmask]
|