Look at slide 15 here:
http://www.slideshare.net/DuraSpace/sds-cwebinar-1
I think we're worried about the cumulative effect over time of
undetected errors (at least, I am).
On Fri, Oct 03, 2014 at 05:37:14AM -0700, Kyle Banerjee wrote:
> On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero <[log in to unmask]> wrote:
>
> > Checksums can be kept separate (tripwire style).
> > For JHU archiving, the use of MD5 would give false positives for duplicate
> > detection.
> >
> > There is no reason to use a bad cryptographic hash. Use a fast hash, or use
> > a safe hash.
> >
>
> I have always been puzzled why so much energy is expended on bit integrity
> in the library and archival communities. Hashing does not accommodate
> modification of internal metadata or compression which do not compromise
> integrity. And if people who can access the files can also access the
> hashes, there is no contribution to security. Also, wholesale hashing of
> repositories scales poorly, My guess is that the biggest threats are staff
> error or rogue processes (i.e. bad programming). Any malicious
> destruction/modification is likely to be an inside job.
>
> In reality, using file size alone is probably sufficient for detecting
> changed files -- if dup detection is desired, then hashing the few that dup
> out can be performed. Though if dups are an actual issue, it reflects
> problems elsewhere. Thrashing disks and cooking the CPU for the purposes
> libraries use hashes for seems way overkill, especially given that basic
> interaction with repositories for depositors, maintainers, and users is
> still in a very primitive state.
>
> kyle
>
--
Charles Blair, Director, Digital Library Development Center, University of Chicago Library
1 773 702 8459 | [log in to unmask] | http://www.lib.uchicago.edu/~chas/
|