I had a project to de duplicate many images and other files too.
I wrote a little ditty in PHP but the idea can by used in any language.
I have a set of tables in MySQL.
give the utility a set of root directories to test and compare
trawl the filestems for filename location and size and store in the first table
issue sql insert into duplicatesizetable Select
filesize,count(filesize) as qty from nametable group by filesize
having qty>1
you now have the sizes of possible duplicates
only now do you crc/md5sum the files of that size
update the nametable with crc/md5 values as calculated
there can be false positives if two different files crc values are the same
then a final bit of sql
Select filesize,count(crc) as qty from nametable group by filesize,crc
having qty>1
I store in a table so I can leave the job and come back
join the results to the nametable
which gets the real duplication sizes and crc, which can now be used
to guide a human to to clean the mess
I give the user a table showing the n files with buttons to delete,
view, ignore (you may want to keep two/more copies)
for safety one can leave part of the filesystem write protected also
the form oly allows one button per group
http://www.archivist.info/Screenshot_Delete_duplicates.png
that took a few minutes only to get the duplicates from a 9gb picture directory
Dave Caroline
|