Print

Print


Hello, Emily --

As a first pass, you may want to create and record checksums for all the files on the hard drive, then examine which checksums are identical.  Those files will be bit-for-bit exact copies of each other, and can be safely deduped.

This technique won't catch the files where the content is substantially the same, except for insignificant changes (an embedded date stamp, for example), but it may get you some ways down the path.

-- Scott

-- 
Scott Prater
Digital Library Architect
UW Digital Collections Center
University of Wisconsin - Madison

-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Emily Lavins
Sent: Monday, October 23, 2023 10:12 AM
To: [log in to unmask]
Subject: [CODE4LIB] Deduping with finesse

Hello Code4Lib,

I received a question about deduping from one of our archivists and I'm wondering if anyone has any experience/recommendations for this sort of thing.

In short: We received a hard drive that has massive amounts of duplicates, and they are starting the process of deduping and arranging it. They want somewhat finer control over which duplicates get retained (currently using FSlint and Bitcurator), so they can ensure 'complete sets' of files are retained. But it'd be great to not have to manually select *every* dedup preference in FSlint.

For example:
1. There is at least one folder that contains numbered audio tracks. When we ran fslint raw, a few of these got deduped in favor of other copies in the filesystem. But it would have been preferred to keep these together.
2. If there is a directory in which most of the working files were originally created together.
3. We'd also generally prefer to keep the copies that will *not* result in, post-dedup, folders containing only a single file scattered throughout the directory.

Hopefully some of that makes sense. Has anyone found any helpful workflows for streamlining the deduping/arranging process?

All I could come up with is logging all of FSlint's decisions, so that any undesirable dedups could be more easily be tracked/reversed later, but I really just don't know enough about any of this.

Thank you very much for your time and thoughts.

All the best,
Emily


--
Emily Lavins
Associate Systems Librarian
Boston College Libraries