It turns out it's straightforward to reimplement the default fingerprinting
algorithm that OpenRefine uses. We did that here to help catch those sorts
of trivial spelling differences in user searches in order to provide
best-bet suggestions for some of our most popular stuff. Here's my
reimplementation; have fun:
Once you have a cluster of strings with a common fingerprint, you'd need to
pick a canonical form for everything in that cluster, since the fingerprint
itself isn't a thing you'd want to expose to humans.
On Wed, Oct 25, 2017 at 11:57 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> Has anybody here played with any clustering techniques for normalizing
> bibliographic data?
> My bibliographic data is fraught with inconsistencies. For example, a
> publisher’s name may be recorded one way, another way, or a third way. The
> same goes for things like publisher place: South Bend; South Bend, IN;
> South Bend, Ind. And then there is the ISBD punctuation that is sometimes
> applied and sometimes not. All of these inconsistencies make indexing &
> faceted browsing more difficult than it needs to be.
> OpenRefine is a really good program for finding these inconsistencies and
> then normalizing them. OpenRefine calls this process “clustering”, and it
> points to a nice page describing the various clustering processes.  Some
> of the techniques included “fingerprinting” and calculating “nearest
> neighbors”. Unfortunately, OpenRefine is not really programable, and I’d
> like to automate much of this process.
> Does anybody here have any experience automating the process of normalize
> bibliographic (MARC) data?
>  about clustering - http://bit.ly/2izQarE
> Eric Morgan
Senior Software Engineer, MIT Libraries: https://libraries.mit.edu/
President, Library & Information Technology Association: http://www.lita.org