On Wed, Mar 20, 2013 at 2:22 AM, chris fitzpatrick
<[log in to unmask]>wrote:

> Anyone please correct me if this is wrong. A md5/sha1 file hash would also
> not get any image derivatives, like crops or they added text or tweaked the
> contrast or photoshopped their cat into the shot...

> If you really wanted to geek out, you could look into some machine
> learning techniques to build a classifier that groups the images for you,
> which might be more a PhD project for someone....

Agreed. BTW, exiftool might be very useful for detecting photos manipulated
in this way because the original create time shouldn't be touched plus
there are some other data points you'd be able to use for comparison. YMMV
depending on software used to manipulate the images.

Picasa is very good at finding similar images. I would have suggested that
earlier except I have no idea how it would perform on 300K photos. It works
quite well in the 20K-30K range though it really seems designed to work
with sets up to several thousand which makes sense given who they aim it
at. But I hate that it mangles metadata since that makes it difficult to
use for tagging unless you don't care about the original metadata and it is
graphically oriented -- I'm pretty sure that it would be far more efficient
to use metadata than to have picasa try to figure things out and then list
out what it thought were dups.

> A less sexy but really good strategy would also be to use AWS Mechanical
> Turk, which I think seems like a really good way to get some basic  image
> annotation.
> Good luck!

My guess is that you'd get better results faster and cheaper just going
with a combination of image metadata and talking to the researcher a bit.
The problem with MT is that they won't actually know what they're looking
at and you're likely to just get inconsistent keywords that are all over
the place (i.e. garbage). Using metadata, you can associate equipment and
times with which groups, places, events, etc. You need a little back and
forth to get you started, but it should be more consistent so people can do
things like actually drill through the images.