On Thu, Oct 26, 2017 at 7:03 AM, Jonathan Rochkind <[log in to unmask]>
wrote:
> I think it's actually worth interrogating and getting specific about what
> we mean by "preservation features".
>
> I think they may not actually be all that complicated or hard to add on to
> nearly any solution. I think an actual 'repository solution' may actually
> not be as complicated as people assume when you actually specify it.
>
> The main preservation feature people actually use, "fixity", is just taking
> a checksum of a file (perhaps using SHA1), storing it somewhere, and then
> later checking to make sure the file still has the same checksum, and
> alerting if it does not. This is a relatively simple feature to add to any
> software.
>
It's also worth considering the function of the checksum. I believe the
argument they mitigate rot on a modern filesystem is weak.
The normal way checksums are implemented presumes the following are immune
to bit rot:
- the OS
- every dependency for the program code
- the interpreter
- the checksum itself
Another way of putting it is that the assumption is that bit rot only
affects very specific types of assets. Fortunately, modern filesystems
detect and repair errors which is why they can be trusted for important
things.
People and rogue processes may intentionally or unintentionally mess things
up. Checksums are potentially useful here, but there are multiple
mechanisms that can be used for that purpose. Note that checksums are
useless against intentional modification if those who modify assets also
have permissions to modify checksums.
We have a simple approach. We've been moving everything to Amazon Glacier
for cold storage using a specially configured S3 bucket that allows us to
use DOIs as keys to retrieve things. IAM policies are set up to prevent
undesirable activity, support versioning, etc.
We're also slowly moving towards using S3 for hot storage assets with
modest IO requirements. This process has been a bit slow because there are
issues with organizational policy and people wrapping their mind around how
S3 and Glacier work -- even a lot of tech people here seem to think of them
in terms of the disks of yore mounted in a rack somewhere else with
ordinary files stored and transmitted in cleartext.
Bottom line is that depending on what Josh needs to do, there may well be
options that are far easier, cheaper, and more reliable than what he has or
could possibly achieve in an all-in-one solution labeled as a
"repository." No need to use a chain saw to cut butter.
kyle
|