Also just stumbled across this on stackoverflow.
http://stackoverflow.com/questions/12115824/compute-the-hash-of-only-the-core-image-data-of-a-tiff
On Wed, Jan 28, 2015 at 10:32 AM, Ronald Houk <
[log in to unmask]> wrote:
> Hello,
>
> I like Danielle's idea. I wonder if it wouldn't be a good idea to
> decouple the metadata from the data permanently. Exiftool allows you to
> export the metadata in lots of different formats like JSON. You could
> export the metadata into JSON, run the checksums and then store the photo
> and the JSON file in a single tar-ball. From there you could use a JSON
> editor to modify/add metadata.
>
> It would be simple to reintroduce the metadata into the file when needed.
>
> On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer <[log in to unmask]>
> wrote:
>
>> Kyle,
>>
>> It's a bit of a hack, but you could write a script to delete all the
>> metadata from images with ExifTool and then run checksums on the resulting
>> image (see
>> http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0).
>> exiv2 might also work. I don't think you'd want to do that every time you
>> audited the files, though; generating new checksums is a faster approach.
>>
>> I haven't tried this, but I know that there's a program called ssdeep
>> developed for the digital forensics community that can do piecewise
>> hashing
>> -- it hashes chunks of content and then compares the hashes for the
>> different chunks to find matches, in theory. It might be able to match
>> files with embedded metadata vs. files without; the use cases described on
>> the forensics wiki is finding altered (truncated) files, or reuse of
>> source
>> code. http://www.forensicswiki.org/wiki/Ssdeep
>>
>> Danielle Cunniff Plumer
>>
>> On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee <[log in to unmask]>
>> wrote:
>>
>> > On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]>
>> > wrote:
>> >
>> > >
>> > > - How is your content packaged?
>> > > - Are you talking about the SIPs or the AIPs or both?
>> > > - Is your content in an instance of Fedora, a unix file structure,
>> or
>> > > something else?
>> > > - Are you generating checksums on the whole package, parts of it,
>> > both?
>> > >
>> >
>> > The quick answer to this is that this is a low tech operation. We're
>> > currently on regular filesystems where we are limited to feeding md5
>> > checksums into a list. I'm looking for a low tech way that makes it
>> easier
>> > to keep track of resources across a variety of platforms in a
>> decentralized
>> > environment and which will easily adopt to future technology
>> transitions.
>> > For example, we have a bunch of stuff in Bepress and Omeka. Neither of
>> > those is good for preservation, so authoritative files live elsewhere
>> as do
>> > a huge number of resources that aren't in these platforms. Filenames are
>> > terrible identifiers and things get moved around even if people don't
>> mess
>> > with the files.
>> >
>> > We also are trying to come up with something that deals with different
>> > kinds of datasets (we're focusing on bioimaging at the moment) and fits
>> in
>> > the workflow of campus units, each of which needs to manage tens of
>> > thousands of files with very little metadata on regular filesystems.
>> Some
>> > of the resources are enormous in terms of size or number of members.
>> >
>> > Simply embedding an identifier in the file is a really easy way to tell
>> > which files have metadata and which metadata is there. In the case at
>> hand,
>> > I could just do that and generate new checksums. But I think the generic
>> > problem of making better use of embedded metadata is an interesting one
>> as
>> > it can make objects more usable and understandable once they're removed.
>> > For example, just this past Friday I received a request to use an image
>> > someone downloaded for a book. Unfortunately, he just emailed me a copy
>> of
>> > the image, described what he wanted to do, and asked for permission but
>> he
>> > couldn't replicate how he found it. An identifier would have been handy
>> as
>> > would have been embedded rights info as this is not the same for all of
>> our
>> > images. The reason we're using DOI's is that they work well for anything
>> > and can easily be recognized by syntax wherever they may appear.
>> >
>> > On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle <
>> > [log in to unmask]>
>> > wrote:
>> >
>> > >
>> > > The problems with 'metadata' in a lot of file formats is that they're
>> > > just arbitrary segments -- you'd have to have a program that knew
>> > > which segments were considered 'headers' vs. not. It might be easier
>> > > to have it be able to compute a separate checksum for each segment,
>> > > so that should the modifications change their order, they'd still
>> > > be considered valid.
>> > >
>> >
>> > This is what I seemed to be bumping up against so I was hoping there
>> was an
>> > easy workaround. But this is helpful information. Thanks,
>> >
>> > kyle
>> >
>>
>
>
>
> --
> Ronald Houk
> Assistant Director
> Ottumwa Public Library
> 102 W. Fourth Street
> Ottumwa, IA 52501
> (641)682-7563x203
> [log in to unmask]
>
--
Ronald Houk
Assistant Director
Ottumwa Public Library
102 W. Fourth Street
Ottumwa, IA 52501
(641)682-7563x203
[log in to unmask]
|