Be careful about assuming too much on this.
When I started working with S3, the system required an MD5 sum to upload, and would respond to requests with this "etag" in the header as well. I therefor assumed that this was integral to the system, and was a good way to compare local files against the remote copies.
Then, maybe a year or two ago, Amazon introduced chunked uploads, so that you could send files in pieces and reassemble them once they got to S3. This was good, because it eliminated problems with huge files failing to upload due to network hicups. I went ahead and implemented it in my scripts. Then, all of a sudden I started getting invalid checksums. Turns out that for multipart file uploads, they now create etag identifiers that are not the md5 sum of the underlying files.
I now store the checksum as a separate piece of header metadata. And my sync script does periodically compare against this. But since this is just metadata, checking it doesn't really prove anything about the underlying file that Amazon has. To do this I would need to write a script that would actually retrieve the file and rerun the checksum. I have not done this yet, although it is on my to-do list at some point. This would ideally happen on an Amazon server so that I wouldn't have to send the file back and forth.
In any case, my main point is: don't assume that you can just check against a checksum from the API to verify a file for digital preservation purposes.
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
[log in to unmask]
>>> Joshua Welker <[log in to unmask]> 1/11/2013 2:45 PM >>>
Thanks for bringing up the issue of the cost of making sure the data is consistent. We will be using DSpace for now, and I know DSpace has some checksum functionality built in out-of-the-box. It shouldn't be too difficult to write a script that loops through DSpace's checksum data and compares it against the files in Glacier. Reading the Glacier FAQ on Amazon's site, it looks like they provide an archive inventory (updated daily) that can be downloaded as JSON. I read some users saying that this inventory includes checksum data. So hopefully it will just be a matter of comparing the local checksum to the Glacier checksum, and that would be easy enough to script.
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ryan Eby
Sent: Friday, January 11, 2013 11:37 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Digital collection backups
As Aaron alludes to your decision should base off your real needs and they might not be exclusive.
LOCKSS/MetaArchive might be worth the money if it is the community archival aspect you are going for. Depending on your institution being a participant might make political/mission sense regardless of the storage needs and it could just be a specific collection that makes sense.
Glacier is a great choice if you are looking for spreading a backup across regions. S3 similarly if you also want to benefit from CloudFront (the CDN
setup) to take load off your institutions server (you can now use cloudfront off your own origin server as well). Depending on your bandwidth this might be worth the money regardless of LOCKSS participation (which can be more dark). Amazon also tends to be dropping prices over time vs raising but as any outsource you have to plan that it might not exist in the future. Also look more at Glacier prices in terms of checking your data for consistency. There have been a few papers on the costs of making sure Amazon really has the proper data depending on how often your requirements want you to check.
Another option if you are just looking for more geo placement is finding an institution or service provider that will colocate. There may be another small institution that would love to shove a cheap box with hard drives on your network in exchange for the same. Not as involved/formal as LOCKSS but gives you something you control to satisfy your requirements. It could also be as low tech as shipping SSDs to another institution who then runs some bagit checksums on the drive, etc.
All of the above should be scriptable in your workflow. Just need to decide what you really want out of it.
On Fri, Jan 11, 2013 at 11:52 AM, Aaron Trehub <[log in to unmask]> wrote:
> Hello Josh,
> Auburn University is a member of two Private LOCKSS Networks: the
> MetaArchive Cooperative and the Alabama Digital Preservation Network
> (ADPNet). Here's a link to a recent conference paper that describes
> both networks, including their current pricing structures:
> LOCKSS has worked well for us so far, in part because supporting
> community-based solutions is important to us. As you point out,
> however, Glacier is an attractive alternative, especially for
> institutions that may be more interested in low-cost, low-throughput
> storage and less concerned about entrusting their content to a
> commercial outfit or having to pay extra to get it back out. As with
> most things, you pay your money--more or less, depending--and make your choice. And take your risks.
> Good luck with whatever solution(s) you decide on. They need not be
> mutually exclusive.
> Aaron Trehub
> Assistant Dean for Technology and Technical Services Auburn University
> 231 Mell Street, RBD Library
> Auburn, AL 36849-5606
> Phone: (334) 844-1716
> Skype: ajtrehub
> E-mail: [log in to unmask]
> URL: http://lib.auburn.edu/