On Thu, Oct 26, 2017 at 7:03 AM, Jonathan Rochkind <[log in to unmask]> wrote: > I think it's actually worth interrogating and getting specific about what > we mean by "preservation features". > > I think they may not actually be all that complicated or hard to add on to > nearly any solution. I think an actual 'repository solution' may actually > not be as complicated as people assume when you actually specify it. > > The main preservation feature people actually use, "fixity", is just taking > a checksum of a file (perhaps using SHA1), storing it somewhere, and then > later checking to make sure the file still has the same checksum, and > alerting if it does not. This is a relatively simple feature to add to any > software. > It's also worth considering the function of the checksum. I believe the argument they mitigate rot on a modern filesystem is weak. The normal way checksums are implemented presumes the following are immune to bit rot: - the OS - every dependency for the program code - the interpreter - the checksum itself Another way of putting it is that the assumption is that bit rot only affects very specific types of assets. Fortunately, modern filesystems detect and repair errors which is why they can be trusted for important things. People and rogue processes may intentionally or unintentionally mess things up. Checksums are potentially useful here, but there are multiple mechanisms that can be used for that purpose. Note that checksums are useless against intentional modification if those who modify assets also have permissions to modify checksums. We have a simple approach. We've been moving everything to Amazon Glacier for cold storage using a specially configured S3 bucket that allows us to use DOIs as keys to retrieve things. IAM policies are set up to prevent undesirable activity, support versioning, etc. We're also slowly moving towards using S3 for hot storage assets with modest IO requirements. This process has been a bit slow because there are issues with organizational policy and people wrapping their mind around how S3 and Glacier work -- even a lot of tech people here seem to think of them in terms of the disks of yore mounted in a rack somewhere else with ordinary files stored and transmitted in cleartext. Bottom line is that depending on what Josh needs to do, there may well be options that are far easier, cheaper, and more reliable than what he has or could possibly achieve in an all-in-one solution labeled as a "repository." No need to use a chain saw to cut butter. kyle