Look at slide 15 here: http://www.slideshare.net/DuraSpace/sds-cwebinar-1 I think we're worried about the cumulative effect over time of undetected errors (at least, I am). On Fri, Oct 03, 2014 at 05:37:14AM -0700, Kyle Banerjee wrote: > On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero <[log in to unmask]> wrote: > > > Checksums can be kept separate (tripwire style). > > For JHU archiving, the use of MD5 would give false positives for duplicate > > detection. > > > > There is no reason to use a bad cryptographic hash. Use a fast hash, or use > > a safe hash. > > > > I have always been puzzled why so much energy is expended on bit integrity > in the library and archival communities. Hashing does not accommodate > modification of internal metadata or compression which do not compromise > integrity. And if people who can access the files can also access the > hashes, there is no contribution to security. Also, wholesale hashing of > repositories scales poorly, My guess is that the biggest threats are staff > error or rogue processes (i.e. bad programming). Any malicious > destruction/modification is likely to be an inside job. > > In reality, using file size alone is probably sufficient for detecting > changed files -- if dup detection is desired, then hashing the few that dup > out can be performed. Though if dups are an actual issue, it reflects > problems elsewhere. Thrashing disks and cooking the CPU for the purposes > libraries use hashes for seems way overkill, especially given that basic > interaction with repositories for depositors, maintainers, and users is > still in a very primitive state. > > kyle > -- Charles Blair, Director, Digital Library Development Center, University of Chicago Library 1 773 702 8459 | [log in to unmask] | http://www.lib.uchicago.edu/~chas/