+1 for OpenRefine. Exactly what it is made to do.
Chad
On Fri, Mar 21, 2014 at 2:29 PM, Roy Tennant <[log in to unmask]> wrote:
> Have you considered dumping it into Open Refine? [1] I haven't used it a
> lot, but it is likely a good tool to find similar data and allow you to
> globally replace with a canonical entry.
> Roy
>
> [1] http://openrefine.org/
>
>
> On Fri, Mar 21, 2014 at 11:24 AM, Ken Irwin <[log in to unmask]> wrote:
>
> > Hi folks,
> >
> > I'm looking for a tool that can look at a list of all of subject terms in
> > a poorly-controlled index as possible candidates for term consolidation.
> > Our student newspaper index has about 16,000 subject terms and they
> include
> > a lot of meaningless typographical and nomenclatural difference, e.g.:
> >
> > Irwin, Ken
> > Irwin, Kenneth
> > Irwin, Mr. Kenneth
> > Irwin, Kenneth R.
> >
> > Basketball - Women
> > Basketball - Women's
> > Basketball-Women
> > Basketball-Women's
> >
> > I would love to have some sort of pattern-matching tool that's smart
> about
> > this sort of thing that could go through the list of terms (as a text
> list,
> > database, xml file, or whatever structure it wants to ingest) and spit
> out
> > some clusters of possible matches.
> >
> > Does anyone know of a tool that's good for that sort of thing?
> >
> > The index is just a bunch of MySQL tables - there is no real
> > controlled-vocab system, though I've recently built some systems to
> suggest
> > known SH's to reduce this sort of redundancy.
> >
> > Any ideas?
> >
> > Thanks!
> > Ken
> >
>
|