LISTSERV 16.5 - CODE4LIB Archives

Checksum will work well if you're wanting to find exact copies of the 
file content...file metadata (like filename, access date, permissions) 
would not affect the calculation of a sha1 or md5 hash, but I'm pretty 
sure exif or other metadata ( which I think is usually stored in the 
file header ) would change the hash. Anyone please correct me if this is 
wrong. A md5/sha1 file hash would also not get any image derivatives, 
like crops or they added text or tweaked the contrast or photoshopped 
their cat into the shot.

There are also various perceptive hash algromiths that you can 
caluculate to compare image content. I think the 3 most popular is 
Average Hash, Perceptive Hash, and Difference Hash ...a good blog post I 
saw on HackerNews about this is.. uh, here => 
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html  
and here =>  
http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html 
. I've used pHash before, got good results but it is super duper 
slow...or at least how I wrote it.  dHash is newer and looks better.
  There's various implimentations of these in different languages... 
here's a python library that has all 3 => 
https://pypi.python.org/pypi/ImageHash/0.1 (I haven't used this tho.)

If you really wanted to geek out, you could look into some machine 
learning techniques to build a classifier that groups the images for 
you, which might be more a PhD project for someone....

A less sexy but really good strategy would also be to use AWS Mechanical 
Turk, which I think seems like a really good way to get some basic  
image annotation.
Good luck!

b,chris.
> Dave Caroline <mailto:[log in to unmask]>
> March 20, 2013 9:18 AM
> I had a project to de duplicate many images and other files too.
> I wrote a little ditty in PHP but the idea can by used in any language.
>
>
> I have a set of tables in MySQL.
> give the utility a set of root directories to test and compare
> trawl the filestems for filename location and size and store in the 
> first table
> issue sql insert into duplicatesizetable Select
> filesize,count(filesize) as qty from nametable group by filesize
> having qty>1
> you now have the sizes of possible duplicates
> only now do you crc/md5sum the files of that size
> update the nametable with crc/md5 values as calculated
> there can be false positives if two different files crc values are the 
> same
> then a final bit of sql
> Select filesize,count(crc) as qty from nametable group by filesize,crc
> having qty>1
> I store in a table so I can leave the job and come back
>
> join the results to the nametable
>
> which gets the real duplication sizes and crc, which can now be used
> to guide a human to to clean the mess
>
> I give the user a table showing the n files with buttons to delete,
> view, ignore (you may want to keep two/more copies)
> for safety one can leave part of the filesystem write protected also
> the form oly allows one button per group
>
> http://www.archivist.info/Screenshot_Delete_duplicates.png
> that took a few minutes only to get the duplicates from a 9gb picture 
> directory
>
> Dave Caroline
> Carmen Mitchell <mailto:[log in to unmask]>
> March 19, 2013 9:51 PM
> Hello Code4Libbers,
>
> I'm working with a faculty member and trying to help them to formalize
> their data collection practices. Part of this process is also going 
> through
> old data and trying to assess what they currently have. This particular
> faculty member has been doing research for 10 years without any kind of
> structure or regular method. So far we have over 2 TB of data in various
> states. (With more to come.)
>
> I've got a programmer working with me to:
> a) identify file types
> b) count how many files of each type
>
> We are now working on de-duping and assessing file size, focusing on the
> JPEGs first. With over 300,000 over them...it might take a while. (Of
> course they aren't following any kind of file naming structure,
> either...It's a mess.)
>
> Any tips or tricks or tools that you might know of to help speed up this
> process? Is there a good image recognition tool that you could suggest 
> that
> would help us with automation?
>
> Thanks,
>
> Carmen Mitchell
> Institutional Repository Librarian
> Cal State San Marcos