Print

Print


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/21/2014 2:34 PM, Andrew Gordon wrote:
> Ken,
> 
> A group in Chicago has been working for a few years now on a
> deduplication toolkit that might do what you are looking for, they
> also have a couple versions that works with an excel file or .csv
> file.
> 
> https://github.com/datamade/dedupe 
> https://github.com/datamade/dedupe-web 
> https://github.com/datamade/csvdedupe
> 
> I have not worked with them extensively, but I have heard others
> find these very useful for entity recognition and resolution.



+1

Attended this very interesting talk on just that

http://pyvideo.org/video/973/big-data-de-duping

./fxk

- -- 
QOTD:
	"A child of 5 could understand this!  Fetch me a child of 5."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTLIj9AAoJEOptrq/fXk6Mjl4H/jMa3b+ekRYNnnvLBdMXUr/C
p+0tAu3SI5GkfbWe1JGLU6cPcM0Ret22RxKg+QslADZ00aGj2RM8sh+4fV0neFXB
/sA7wHh/8thtFW1njKpaLQZg5f+px6zB8ch9wdp4yf7L0pPb1612fxGRHMjH5u51
vFUAF3r6wM3JIYjAEPKhzq5511soASisV0IWMEyAoRYNyjKbOyan/gN97G/oYxXp
MvwxFAwiOPgwL83Set0kMqztCA2aW76uFwwgvWkhGIcywBR7w7Adl1/MTM9oLBtd
lyeimBXWKvqvArai9txMcC4mOLkZq03FAWypVhe+VOBm4xmmDhowr3YeaaJWl3k=
=Kv3q
-----END PGP SIGNATURE-----