I actually love the approach Mark writes about here. It was partly what inspired me to do this work in MarcEdit -- abet, in a light-weight way -- so not to incur any additional dependencies. --tr On Wed, Oct 25, 2017 at 12:23 PM, Phillips, Mark <[log in to unmask]> wrote: > Of possible interest is some work we've done to take the clustering > capabilities of OpenRefine and bake them into our metadata editing > interface for The Portal to Texas History and the UNT Digital Library. > > We've focused a bit on interfaces which might be of interest. I've > written a bit about it in this post. http://vphill.com/journal/post/6173/ > > We are generating the clusters on facets from Solr. > > Mark > ________________________________________ > From: Code for Libraries <[log in to unmask]> on behalf of Péter > Király <[log in to unmask]> > Sent: Wednesday, October 25, 2017 11:18:52 AM > To: [log in to unmask] > Subject: [EXT] Re: [CODE4LIB] clustering techniques for normalizing > bibliographic data > > Hi Eric, > > I am planning to work on detecting such anomalities. What I have > thought about so far the following approaches: > - n-gram analysis > - basket analysis > - similarity detection of Solr > - final state automat > > The tools I will use: Apache Solr and Apache Spark. I haven't started > yet the implementation. > > Best, > Péter > > > 2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <[log in to unmask]>: > > Has anybody here played with any clustering techniques for normalizing > bibliographic data? > > > > My bibliographic data is fraught with inconsistencies. For example, a > publisher’s name may be recorded one way, another way, or a third way. The > same goes for things like publisher place: South Bend; South Bend, IN; > South Bend, Ind. And then there is the ISBD punctuation that is sometimes > applied and sometimes not. All of these inconsistencies make indexing & > faceted browsing more difficult than it needs to be. > > > > OpenRefine is a really good program for finding these inconsistencies > and then normalizing them. OpenRefine calls this process “clustering”, and > it points to a nice page describing the various clustering processes. [1] > Some of the techniques included “fingerprinting” and calculating “nearest > neighbors”. Unfortunately, OpenRefine is not really programable, and I’d > like to automate much of this process. > > > > Does anybody here have any experience automating the process of > normalize bibliographic (MARC) data? > > > > [1] about clustering - http://bit.ly/2izQarE > > > > — > > Eric Morgan > > > > -- > Péter Király > software developer > GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal > http://linkedin.com/in/peterkiraly >