If you have no opposition to Python, I suggest looking at Fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy M. -- Mark A. Matienzo <[log in to unmask]> Director of Technology, Digital Public Library of America On Fri, Mar 21, 2014 at 2:34 PM, Andrew Gordon <[log in to unmask]> wrote: > Ken, > > A group in Chicago has been working for a few years now on a deduplication > toolkit that might do what you are looking for, they also have a couple > versions that works with an excel file or .csv file. > > https://github.com/datamade/dedupe > https://github.com/datamade/dedupe-web > https://github.com/datamade/csvdedupe > > I have not worked with them extensively, but I have heard others find > these very useful for entity recognition and resolution. > > > > > > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Ken Irwin > Sent: Friday, March 21, 2014 2:25 PM > To: [log in to unmask] > Subject: [CODE4LIB] tool for finding close matches in vocabular list > > Hi folks, > > I'm looking for a tool that can look at a list of all of subject terms in > a poorly-controlled index as possible candidates for term consolidation. > Our student newspaper index has about 16,000 subject terms and they include > a lot of meaningless typographical and nomenclatural difference, e.g.: > > Irwin, Ken > Irwin, Kenneth > Irwin, Mr. Kenneth > Irwin, Kenneth R. > > Basketball - Women > Basketball - Women's > Basketball-Women > Basketball-Women's > > I would love to have some sort of pattern-matching tool that's smart about > this sort of thing that could go through the list of terms (as a text list, > database, xml file, or whatever structure it wants to ingest) and spit out > some clusters of possible matches. > > Does anyone know of a tool that's good for that sort of thing? > > The index is just a bunch of MySQL tables - there is no real > controlled-vocab system, though I've recently built some systems to suggest > known SH's to reduce this sort of redundancy. > > Any ideas? > > Thanks! > Ken >