Bit integrity is crucial for libraries and archives, especially government
archives. Authenticity is a key concept for born-digital archives. We need
to be able to definitively say that this file has not changed since it was
received from the donor or organizational unit, for accountability and
transparency issues. The authenticity trail is needed for evidence in
courts and in some cases mandated by the government. And of course fixity
checking also helps detect bit corruption, another important part of
digital preservation.
Regarding, if someone has access to the file, they have access to the
checksum, it's not always the whole picture. Best practices for digital
preservation recommend having copies in multiple places, like a dark
archive, and systematically running checksums on all the copies and
comparing them. Someone might be able to gain access to one system, but
much more unlikely that they'll get access to all systems. So, if there's a
fixity change in one place and not others, it is flagged for investigation
and comparison.
Nathan
On Fri, Oct 3, 2014 at 8:37 AM, Kyle Banerjee <[log in to unmask]>
wrote:
> On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero <[log in to unmask]> wrote:
>
> > Checksums can be kept separate (tripwire style).
> > For JHU archiving, the use of MD5 would give false positives for
> duplicate
> > detection.
> >
> > There is no reason to use a bad cryptographic hash. Use a fast hash, or
> use
> > a safe hash.
> >
>
> I have always been puzzled why so much energy is expended on bit integrity
> in the library and archival communities. Hashing does not accommodate
> modification of internal metadata or compression which do not compromise
> integrity. And if people who can access the files can also access the
> hashes, there is no contribution to security. Also, wholesale hashing of
> repositories scales poorly, My guess is that the biggest threats are staff
> error or rogue processes (i.e. bad programming). Any malicious
> destruction/modification is likely to be an inside job.
>
> In reality, using file size alone is probably sufficient for detecting
> changed files -- if dup detection is desired, then hashing the few that dup
> out can be performed. Though if dups are an actual issue, it reflects
> problems elsewhere. Thrashing disks and cooking the CPU for the purposes
> libraries use hashes for seems way overkill, especially given that basic
> interaction with repositories for depositors, maintainers, and users is
> still in a very primitive state.
>
> kyle
>
|