LISTSERV 16.5 - CODE4LIB Archives

Hi Eric,

I am planning to work on detecting such anomalities. What I have
thought about so far the following approaches:
- n-gram analysis
- basket analysis
- similarity detection of Solr
- final state automat

The tools I will use: Apache Solr and Apache Spark. I haven't started
yet the implementation.

Best,
Péter


2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <[log in to unmask]>:
> Has anybody here played with any clustering techniques for normalizing bibliographic data?
>
> My bibliographic data is fraught with inconsistencies. For example, a publisher’s name may be recorded one way, another way, or a third way. The same goes for things like publisher place: South Bend; South Bend, IN; South Bend, Ind. And then there is the ISBD punctuation that is sometimes applied and sometimes not. All of these inconsistencies make indexing & faceted browsing more difficult than it needs to be.
>
> OpenRefine is a really good program for finding these inconsistencies and then normalizing them. OpenRefine calls this process “clustering”, and it points to a nice page describing the various clustering processes. [1] Some of the techniques included “fingerprinting” and calculating “nearest neighbors”. Unfortunately, OpenRefine is not really programable, and I’d like to automate much of this process.
>
> Does anybody here have any experience automating the process of normalize bibliographic (MARC) data?
>
> [1] about clustering - http://bit.ly/2izQarE
>
> —
> Eric Morgan



-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly