LISTSERV 16.5 - CODE4LIB Archives

Hi, 

Our process isn't very finessed, but right now it's:

-run brunnhilde on the files to get reports, in this case we want the duplicates.csv report

-if requested, the archivist does manual review of that csv to decide which copy of the file should be retained. The copy that should be retained should be the first one in the list (i.e. if there are 4 copies of a file listed in the duplicates.csv, the first one will be kept). I can see though how that would be cumbersome if you had hundreds to correct/move. It could likely be scripted if they all followed a similar pattern (e.g. were all in the same path). 

-we run a little bash script that reads through that csv line by line. It looks at the checksum, if the checksum does not match the previous checksum in the list, it continues. If it does match the previous checksum in the list, then it moves that file to a Duplicates directory and logs that move, e.g. "MOVED <originalFilePath> to <directory>"

And, like others have said, none of this captures close duplicates.