If you have no opposition to Python, I suggest looking at Fuzzywuzzy:
https://github.com/seatgeek/fuzzywuzzy
M.
--
Mark A. Matienzo <[log in to unmask]>
Director of Technology, Digital Public Library of America
On Fri, Mar 21, 2014 at 2:34 PM, Andrew Gordon <[log in to unmask]> wrote:
> Ken,
>
> A group in Chicago has been working for a few years now on a deduplication
> toolkit that might do what you are looking for, they also have a couple
> versions that works with an excel file or .csv file.
>
> https://github.com/datamade/dedupe
> https://github.com/datamade/dedupe-web
> https://github.com/datamade/csvdedupe
>
> I have not worked with them extensively, but I have heard others find
> these very useful for entity recognition and resolution.
>
>
>
>
>
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Ken Irwin
> Sent: Friday, March 21, 2014 2:25 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] tool for finding close matches in vocabular list
>
> Hi folks,
>
> I'm looking for a tool that can look at a list of all of subject terms in
> a poorly-controlled index as possible candidates for term consolidation.
> Our student newspaper index has about 16,000 subject terms and they include
> a lot of meaningless typographical and nomenclatural difference, e.g.:
>
> Irwin, Ken
> Irwin, Kenneth
> Irwin, Mr. Kenneth
> Irwin, Kenneth R.
>
> Basketball - Women
> Basketball - Women's
> Basketball-Women
> Basketball-Women's
>
> I would love to have some sort of pattern-matching tool that's smart about
> this sort of thing that could go through the list of terms (as a text list,
> database, xml file, or whatever structure it wants to ingest) and spit out
> some clusters of possible matches.
>
> Does anyone know of a tool that's good for that sort of thing?
>
> The index is just a bunch of MySQL tables - there is no real
> controlled-vocab system, though I've recently built some systems to suggest
> known SH's to reduce this sort of redundancy.
>
> Any ideas?
>
> Thanks!
> Ken
>
|