I am planning to work on detecting such anomalities. What I have
thought about so far the following approaches:
- n-gram analysis
- basket analysis
- similarity detection of Solr
- final state automat
The tools I will use: Apache Solr and Apache Spark. I haven't started
yet the implementation.
2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <[log in to unmask]>:
> Has anybody here played with any clustering techniques for normalizing bibliographic data?
> My bibliographic data is fraught with inconsistencies. For example, a publisher’s name may be recorded one way, another way, or a third way. The same goes for things like publisher place: South Bend; South Bend, IN; South Bend, Ind. And then there is the ISBD punctuation that is sometimes applied and sometimes not. All of these inconsistencies make indexing & faceted browsing more difficult than it needs to be.
> OpenRefine is a really good program for finding these inconsistencies and then normalizing them. OpenRefine calls this process “clustering”, and it points to a nice page describing the various clustering processes.  Some of the techniques included “fingerprinting” and calculating “nearest neighbors”. Unfortunately, OpenRefine is not really programable, and I’d like to automate much of this process.
> Does anybody here have any experience automating the process of normalize bibliographic (MARC) data?
>  about clustering - http://bit.ly/2izQarE
> Eric Morgan
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal