+1 for OpenRefine. Exactly what it is made to do. Chad On Fri, Mar 21, 2014 at 2:29 PM, Roy Tennant <[log in to unmask]> wrote: > Have you considered dumping it into Open Refine? [1] I haven't used it a > lot, but it is likely a good tool to find similar data and allow you to > globally replace with a canonical entry. > Roy > > [1] http://openrefine.org/ > > > On Fri, Mar 21, 2014 at 11:24 AM, Ken Irwin <[log in to unmask]> wrote: > > > Hi folks, > > > > I'm looking for a tool that can look at a list of all of subject terms in > > a poorly-controlled index as possible candidates for term consolidation. > > Our student newspaper index has about 16,000 subject terms and they > include > > a lot of meaningless typographical and nomenclatural difference, e.g.: > > > > Irwin, Ken > > Irwin, Kenneth > > Irwin, Mr. Kenneth > > Irwin, Kenneth R. > > > > Basketball - Women > > Basketball - Women's > > Basketball-Women > > Basketball-Women's > > > > I would love to have some sort of pattern-matching tool that's smart > about > > this sort of thing that could go through the list of terms (as a text > list, > > database, xml file, or whatever structure it wants to ingest) and spit > out > > some clusters of possible matches. > > > > Does anyone know of a tool that's good for that sort of thing? > > > > The index is just a bunch of MySQL tables - there is no real > > controlled-vocab system, though I've recently built some systems to > suggest > > known SH's to reduce this sort of redundancy. > > > > Any ideas? > > > > Thanks! > > Ken > > >