Also just stumbled across this on stackoverflow.
On Wed, Jan 28, 2015 at 10:32 AM, Ronald Houk <
[log in to unmask]> wrote:
> I like Danielle's idea. I wonder if it wouldn't be a good idea to
> decouple the metadata from the data permanently. Exiftool allows you to
> export the metadata in lots of different formats like JSON. You could
> export the metadata into JSON, run the checksums and then store the photo
> and the JSON file in a single tar-ball. From there you could use a JSON
> editor to modify/add metadata.
> It would be simple to reintroduce the metadata into the file when needed.
> On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer <[log in to unmask]>
>> It's a bit of a hack, but you could write a script to delete all the
>> metadata from images with ExifTool and then run checksums on the resulting
>> image (see
>> exiv2 might also work. I don't think you'd want to do that every time you
>> audited the files, though; generating new checksums is a faster approach.
>> I haven't tried this, but I know that there's a program called ssdeep
>> developed for the digital forensics community that can do piecewise
>> -- it hashes chunks of content and then compares the hashes for the
>> different chunks to find matches, in theory. It might be able to match
>> files with embedded metadata vs. files without; the use cases described on
>> the forensics wiki is finding altered (truncated) files, or reuse of
>> code. http://www.forensicswiki.org/wiki/Ssdeep
>> Danielle Cunniff Plumer
>> On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee <[log in to unmask]>
>> > On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]>
>> > wrote:
>> > >
>> > > - How is your content packaged?
>> > > - Are you talking about the SIPs or the AIPs or both?
>> > > - Is your content in an instance of Fedora, a unix file structure,
>> > > something else?
>> > > - Are you generating checksums on the whole package, parts of it,
>> > both?
>> > >
>> > The quick answer to this is that this is a low tech operation. We're
>> > currently on regular filesystems where we are limited to feeding md5
>> > checksums into a list. I'm looking for a low tech way that makes it
>> > to keep track of resources across a variety of platforms in a
>> > environment and which will easily adopt to future technology
>> > For example, we have a bunch of stuff in Bepress and Omeka. Neither of
>> > those is good for preservation, so authoritative files live elsewhere
>> as do
>> > a huge number of resources that aren't in these platforms. Filenames are
>> > terrible identifiers and things get moved around even if people don't
>> > with the files.
>> > We also are trying to come up with something that deals with different
>> > kinds of datasets (we're focusing on bioimaging at the moment) and fits
>> > the workflow of campus units, each of which needs to manage tens of
>> > thousands of files with very little metadata on regular filesystems.
>> > of the resources are enormous in terms of size or number of members.
>> > Simply embedding an identifier in the file is a really easy way to tell
>> > which files have metadata and which metadata is there. In the case at
>> > I could just do that and generate new checksums. But I think the generic
>> > problem of making better use of embedded metadata is an interesting one
>> > it can make objects more usable and understandable once they're removed.
>> > For example, just this past Friday I received a request to use an image
>> > someone downloaded for a book. Unfortunately, he just emailed me a copy
>> > the image, described what he wanted to do, and asked for permission but
>> > couldn't replicate how he found it. An identifier would have been handy
>> > would have been embedded rights info as this is not the same for all of
>> > images. The reason we're using DOI's is that they work well for anything
>> > and can easily be recognized by syntax wherever they may appear.
>> > On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle <
>> > [log in to unmask]>
>> > wrote:
>> > >
>> > > The problems with 'metadata' in a lot of file formats is that they're
>> > > just arbitrary segments -- you'd have to have a program that knew
>> > > which segments were considered 'headers' vs. not. It might be easier
>> > > to have it be able to compute a separate checksum for each segment,
>> > > so that should the modifications change their order, they'd still
>> > > be considered valid.
>> > >
>> > This is what I seemed to be bumping up against so I was hoping there
>> was an
>> > easy workaround. But this is helpful information. Thanks,
>> > kyle
> Ronald Houk
> Assistant Director
> Ottumwa Public Library
> 102 W. Fourth Street
> Ottumwa, IA 52501
> [log in to unmask]
Ottumwa Public Library
102 W. Fourth Street
Ottumwa, IA 52501
[log in to unmask]