Ken, A group in Chicago has been working for a few years now on a deduplication toolkit that might do what you are looking for, they also have a couple versions that works with an excel file or .csv file. https://github.com/datamade/dedupe https://github.com/datamade/dedupe-web https://github.com/datamade/csvdedupe I have not worked with them extensively, but I have heard others find these very useful for entity recognition and resolution. -----Original Message----- From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ken Irwin Sent: Friday, March 21, 2014 2:25 PM To: [log in to unmask] Subject: [CODE4LIB] tool for finding close matches in vocabular list Hi folks, I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.: Irwin, Ken Irwin, Kenneth Irwin, Mr. Kenneth Irwin, Kenneth R. Basketball - Women Basketball - Women's Basketball-Women Basketball-Women's I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches. Does anyone know of a tool that's good for that sort of thing? The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy. Any ideas? Thanks! Ken