LISTSERV 16.5 - CODE4LIB Archives

As Roy suggests, Open Refine is designed for this type of work and could easily deal with the volume you are talking about here. It can cluster terms using a variety of algorithms and easily apply a set of standard transformations.

The screencasts and info at http://freeyourmetadata.org/cleanup/ might be a good starting point if you want to see what Refine can do

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936

On 21 Mar 2014, at 18:24, Ken Irwin <[log in to unmask]> wrote:

> Hi folks,
> 
> I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.:
> 
> Irwin, Ken
> Irwin, Kenneth
> Irwin, Mr. Kenneth
> Irwin, Kenneth R.
> 
> Basketball - Women
> Basketball - Women's
> Basketball-Women
> Basketball-Women's
> 
> I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches.
> 
> Does anyone know of a tool that's good for that sort of thing?
> 
> The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy.
> 
> Any ideas?
> 
> Thanks!
> Ken