On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero <[log in to unmask]> wrote:
> Checksums can be kept separate (tripwire style).
> For JHU archiving, the use of MD5 would give false positives for duplicate
> detection.
>
> There is no reason to use a bad cryptographic hash. Use a fast hash, or use
> a safe hash.
>
I have always been puzzled why so much energy is expended on bit integrity
in the library and archival communities. Hashing does not accommodate
modification of internal metadata or compression which do not compromise
integrity. And if people who can access the files can also access the
hashes, there is no contribution to security. Also, wholesale hashing of
repositories scales poorly, My guess is that the biggest threats are staff
error or rogue processes (i.e. bad programming). Any malicious
destruction/modification is likely to be an inside job.
In reality, using file size alone is probably sufficient for detecting
changed files -- if dup detection is desired, then hashing the few that dup
out can be performed. Though if dups are an actual issue, it reflects
problems elsewhere. Thrashing disks and cooking the CPU for the purposes
libraries use hashes for seems way overkill, especially given that basic
interaction with repositories for depositors, maintainers, and users is
still in a very primitive state.
kyle
|