On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair <[log in to unmask]> wrote: > Look at slide 15 here: > http://www.slideshare.net/DuraSpace/sds-cwebinar-1 > > I think we're worried about the cumulative effect over time of > undetected errors (at least, I am). This slide shows that data loss via drive fault is extremely rare. Note that a bit getting flipped is usually harmless. However, I do believe that data corruption via other avenues will be considerably more common. My point is that the use case for libraries is generally weak and the solution is very expensive -- don't forget the authenticity checks must also be done on the "good" files. As you start dealing with more and more data, this system is not sustainable for the simple reason that maintained disk space costs a fortune and network capacity is a bottleneck. It's no big deal to do this on a few TB since our repositories don't have to worry about the integrity of dynamic data, but you eventually get to a point where it sucks up too many systems resources and consumes too much expertise. Authoritative files really should be offline but if online access to authoritative files is seen as an imperative, it at least makes more sense to just do something like dump it all in Glacier and slowly refresh everything you own with authoritative copy. Or better yet, just leave the stuff there and just make new derivatives when there is any reason to believe the existing ones are not good. While I think integrity is an issue, I think other deficiencies in repositories are more pressing. Except for the simplest use cases, getting stuff in or out of them is a hopeless process even with automated assistance. Metadata and maintenance aren't very good either. That you still need coding skills to get popular platforms that have been in use for many years to ingest and serve up things as simple as documents and images speaks volumes. kyle