Hello Code4Lib,
I received a question about deduping from one of our archivists and I'm
wondering if anyone has any experience/recommendations for this sort of
thing.
In short: We received a hard drive that has massive amounts of duplicates,
and they are starting the process of deduping and arranging it. They want
somewhat finer control over which duplicates get retained (currently using
FSlint and Bitcurator), so they can ensure 'complete sets' of files are
retained. But it'd be great to not have to manually select *every* dedup
preference in FSlint.
For example:
1. There is at least one folder that contains numbered audio tracks. When
we ran fslint raw, a few of these got deduped in favor of other copies in
the filesystem. But it would have been preferred to keep these together.
2. If there is a directory in which most of the working files were
originally created together.
3. We'd also generally prefer to keep the copies that will *not* result in,
post-dedup, folders containing only a single file scattered throughout the
directory.
Hopefully some of that makes sense. Has anyone found any helpful workflows
for streamlining the deduping/arranging process?
All I could come up with is logging all of FSlint's decisions, so that any
undesirable dedups could be more easily be tracked/reversed later, but I
really just don't know enough about any of this.
Thank you very much for your time and thoughts.
All the best,
Emily
--
Emily Lavins
Associate Systems Librarian
Boston College Libraries
|