Following up on the OpenRefine suggestion, here's a blog post I wrote last year that describes using it to consolidate terms. http://acrl.ala.org/techconnect/?p=3276
Digital Services Librarian
Loyola University Chicago
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ken Irwin
Sent: Friday, March 21, 2014 1:25 PM
To: [log in to unmask]
Subject: [CODE4LIB] tool for finding close matches in vocabular list
I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.:
Irwin, Mr. Kenneth
Irwin, Kenneth R.
Basketball - Women
Basketball - Women's
I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches.
Does anyone know of a tool that's good for that sort of thing?
The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy.