Simon - do you have any examples of MD5 collisions in JHU's collections?
The chance of that occurring is vanishingly small (
http://prezi.com/zfyebvaelksh/fixity-20/) so I'm curious what produced the
collision, and how often.
On Fri, Oct 3, 2014 at 12:14 PM, Kyle Banerjee <[log in to unmask]>
wrote:
> On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair <[log in to unmask]> wrote:
>
> > Look at slide 15 here:
> > http://www.slideshare.net/DuraSpace/sds-cwebinar-1
> >
> > I think we're worried about the cumulative effect over time of
> > undetected errors (at least, I am).
>
>
> This slide shows that data loss via drive fault is extremely rare. Note
> that a bit getting flipped is usually harmless. However, I do believe that
> data corruption via other avenues will be considerably more common.
>
> My point is that the use case for libraries is generally weak and the
> solution is very expensive -- don't forget the authenticity checks must
> also be done on the "good" files. As you start dealing with more and more
> data, this system is not sustainable for the simple reason that maintained
> disk space costs a fortune and network capacity is a bottleneck. It's no
> big deal to do this on a few TB since our repositories don't have to worry
> about the integrity of dynamic data, but you eventually get to a point
> where it sucks up too many systems resources and consumes too much
> expertise.
>
> Authoritative files really should be offline but if online access to
> authoritative files is seen as an imperative, it at least makes more sense
> to just do something like dump it all in Glacier and slowly refresh
> everything you own with authoritative copy. Or better yet, just leave the
> stuff there and just make new derivatives when there is any reason to
> believe the existing ones are not good.
>
> While I think integrity is an issue, I think other deficiencies in
> repositories are more pressing. Except for the simplest use cases, getting
> stuff in or out of them is a hopeless process even with automated
> assistance. Metadata and maintenance aren't very good either. That you
> still need coding skills to get popular platforms that have been in use for
> many years to ingest and serve up things as simple as documents and images
> speaks volumes.
>
> kyle
>
|