On Fri, Jan 11, 2013 at 07:45:21PM +0000, Joshua Welker wrote:
> Thanks for bringing up the issue of the cost of making sure the data is consistent. We will be using DSpace for now, and I know DSpace has some checksum functionality built in out-of-the-box. It shouldn't be too difficult to write a script that loops through DSpace's checksum data and compares it against the files in Glacier. Reading the Glacier FAQ on Amazon's site, it looks like they provide an archive inventory (updated daily) that can be downloaded as JSON. I read some users saying that this inventory includes checksum data. So hopefully it will just be a matter of comparing the local checksum to the Glacier checksum, and that would be easy enough to script.
An important question to ask here, though, is if that included checksum
data is the same that Amazon uses to perform the "systematic data
integrity checks" they mention in the Glacier FAQ, or if it's just
catalog data --- "here's the checksum when we put it in". This is always
the question we run into when we consider services like this, can we
tease enough information out to convince ourselves that their checking
is sufficient.
--
Thomas L. Kula | [log in to unmask]
Systems Engineer | Library Information Technology Office
The Libraries, Columbia University in the City of New York
|