Simon - do you have any examples of MD5 collisions in JHU's collections? The chance of that occurring is vanishingly small ( http://prezi.com/zfyebvaelksh/fixity-20/) so I'm curious what produced the collision, and how often. On Fri, Oct 3, 2014 at 12:14 PM, Kyle Banerjee <[log in to unmask]> wrote: > On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair <[log in to unmask]> wrote: > > > Look at slide 15 here: > > http://www.slideshare.net/DuraSpace/sds-cwebinar-1 > > > > I think we're worried about the cumulative effect over time of > > undetected errors (at least, I am). > > > This slide shows that data loss via drive fault is extremely rare. Note > that a bit getting flipped is usually harmless. However, I do believe that > data corruption via other avenues will be considerably more common. > > My point is that the use case for libraries is generally weak and the > solution is very expensive -- don't forget the authenticity checks must > also be done on the "good" files. As you start dealing with more and more > data, this system is not sustainable for the simple reason that maintained > disk space costs a fortune and network capacity is a bottleneck. It's no > big deal to do this on a few TB since our repositories don't have to worry > about the integrity of dynamic data, but you eventually get to a point > where it sucks up too many systems resources and consumes too much > expertise. > > Authoritative files really should be offline but if online access to > authoritative files is seen as an imperative, it at least makes more sense > to just do something like dump it all in Glacier and slowly refresh > everything you own with authoritative copy. Or better yet, just leave the > stuff there and just make new derivatives when there is any reason to > believe the existing ones are not good. > > While I think integrity is an issue, I think other deficiencies in > repositories are more pressing. Except for the simplest use cases, getting > stuff in or out of them is a hopeless process even with automated > assistance. Metadata and maintenance aren't very good either. That you > still need coding skills to get popular platforms that have been in use for > many years to ingest and serve up things as simple as documents and images > speaks volumes. > > kyle >