LISTSERV 16.5 - CODE4LIB Archives

I actually love the approach Mark writes about here.  It was partly what
inspired me to do this work in MarcEdit -- abet, in a light-weight way --
so not to incur any additional dependencies.

--tr

On Wed, Oct 25, 2017 at 12:23 PM, Phillips, Mark <[log in to unmask]>
wrote:

> Of possible interest is some work we've done to take the clustering
> capabilities of OpenRefine and bake them into our metadata editing
> interface for The Portal to Texas History and the UNT Digital Library.
>
> We've focused a bit on interfaces which might be of interest.  I've
> written a bit about it in this post. http://vphill.com/journal/post/6173/
>
> We are generating the clusters on facets from Solr.
>
> Mark
> ________________________________________
> From: Code for Libraries <[log in to unmask]> on behalf of Péter
> Király <[log in to unmask]>
> Sent: Wednesday, October 25, 2017 11:18:52 AM
> To: [log in to unmask]
> Subject: [EXT] Re: [CODE4LIB] clustering techniques for normalizing
> bibliographic data
>
> Hi Eric,
>
> I am planning to work on detecting such anomalities. What I have
> thought about so far the following approaches:
> - n-gram analysis
> - basket analysis
> - similarity detection of Solr
> - final state automat
>
> The tools I will use: Apache Solr and Apache Spark. I haven't started
> yet the implementation.
>
> Best,
> Péter
>
>
> 2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <[log in to unmask]>:
> > Has anybody here played with any clustering techniques for normalizing
> bibliographic data?
> >
> > My bibliographic data is fraught with inconsistencies. For example, a
> publisher’s name may be recorded one way, another way, or a third way. The
> same goes for things like publisher place: South Bend; South Bend, IN;
> South Bend, Ind. And then there is the ISBD punctuation that is sometimes
> applied and sometimes not. All of these inconsistencies make indexing &
> faceted browsing more difficult than it needs to be.
> >
> > OpenRefine is a really good program for finding these inconsistencies
> and then normalizing them. OpenRefine calls this process “clustering”, and
> it points to a nice page describing the various clustering processes. [1]
> Some of the techniques included “fingerprinting” and calculating “nearest
> neighbors”. Unfortunately, OpenRefine is not really programable, and I’d
> like to automate much of this process.
> >
> > Does anybody here have any experience automating the process of
> normalize bibliographic (MARC) data?
> >
> > [1] about clustering - http://bit.ly/2izQarE
> >
> > —
> > Eric Morgan
>
>
>
> --
> Péter Király
> software developer
> GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
> http://linkedin.com/in/peterkiraly
>