On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair <[log in to unmask]> wrote:
> Look at slide 15 here:
> http://www.slideshare.net/DuraSpace/sds-cwebinar-1
>
> I think we're worried about the cumulative effect over time of
> undetected errors (at least, I am).
This slide shows that data loss via drive fault is extremely rare. Note
that a bit getting flipped is usually harmless. However, I do believe that
data corruption via other avenues will be considerably more common.
My point is that the use case for libraries is generally weak and the
solution is very expensive -- don't forget the authenticity checks must
also be done on the "good" files. As you start dealing with more and more
data, this system is not sustainable for the simple reason that maintained
disk space costs a fortune and network capacity is a bottleneck. It's no
big deal to do this on a few TB since our repositories don't have to worry
about the integrity of dynamic data, but you eventually get to a point
where it sucks up too many systems resources and consumes too much
expertise.
Authoritative files really should be offline but if online access to
authoritative files is seen as an imperative, it at least makes more sense
to just do something like dump it all in Glacier and slowly refresh
everything you own with authoritative copy. Or better yet, just leave the
stuff there and just make new derivatives when there is any reason to
believe the existing ones are not good.
While I think integrity is an issue, I think other deficiencies in
repositories are more pressing. Except for the simplest use cases, getting
stuff in or out of them is a hopeless process even with automated
assistance. Metadata and maintenance aren't very good either. That you
still need coding skills to get popular platforms that have been in use for
many years to ingest and serve up things as simple as documents and images
speaks volumes.
kyle
|