On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero <[log in to unmask]> wrote: > Checksums can be kept separate (tripwire style). > For JHU archiving, the use of MD5 would give false positives for duplicate > detection. > > There is no reason to use a bad cryptographic hash. Use a fast hash, or use > a safe hash. > I have always been puzzled why so much energy is expended on bit integrity in the library and archival communities. Hashing does not accommodate modification of internal metadata or compression which do not compromise integrity. And if people who can access the files can also access the hashes, there is no contribution to security. Also, wholesale hashing of repositories scales poorly, My guess is that the biggest threats are staff error or rogue processes (i.e. bad programming). Any malicious destruction/modification is likely to be an inside job. In reality, using file size alone is probably sufficient for detecting changed files -- if dup detection is desired, then hashing the few that dup out can be performed. Though if dups are an actual issue, it reflects problems elsewhere. Thrashing disks and cooking the CPU for the purposes libraries use hashes for seems way overkill, especially given that basic interaction with repositories for depositors, maintainers, and users is still in a very primitive state. kyle