On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz <[log in to unmask]>
wrote:
>
> - How is your content packaged?
> - Are you talking about the SIPs or the AIPs or both?
> - Is your content in an instance of Fedora, a unix file structure, or
> something else?
> - Are you generating checksums on the whole package, parts of it, both?
>
The quick answer to this is that this is a low tech operation. We're
currently on regular filesystems where we are limited to feeding md5
checksums into a list. I'm looking for a low tech way that makes it easier
to keep track of resources across a variety of platforms in a decentralized
environment and which will easily adopt to future technology transitions.
For example, we have a bunch of stuff in Bepress and Omeka. Neither of
those is good for preservation, so authoritative files live elsewhere as do
a huge number of resources that aren't in these platforms. Filenames are
terrible identifiers and things get moved around even if people don't mess
with the files.
We also are trying to come up with something that deals with different
kinds of datasets (we're focusing on bioimaging at the moment) and fits in
the workflow of campus units, each of which needs to manage tens of
thousands of files with very little metadata on regular filesystems. Some
of the resources are enormous in terms of size or number of members.
Simply embedding an identifier in the file is a really easy way to tell
which files have metadata and which metadata is there. In the case at hand,
I could just do that and generate new checksums. But I think the generic
problem of making better use of embedded metadata is an interesting one as
it can make objects more usable and understandable once they're removed.
For example, just this past Friday I received a request to use an image
someone downloaded for a book. Unfortunately, he just emailed me a copy of
the image, described what he wanted to do, and asked for permission but he
couldn't replicate how he found it. An identifier would have been handy as
would have been embedded rights info as this is not the same for all of our
images. The reason we're using DOI's is that they work well for anything
and can easily be recognized by syntax wherever they may appear.
On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle <[log in to unmask]>
wrote:
>
> The problems with 'metadata' in a lot of file formats is that they're
> just arbitrary segments -- you'd have to have a program that knew
> which segments were considered 'headers' vs. not. It might be easier
> to have it be able to compute a separate checksum for each segment,
> so that should the modifications change their order, they'd still
> be considered valid.
>
This is what I seemed to be bumping up against so I was hoping there was an
easy workaround. But this is helpful information. Thanks,
kyle
|