LISTSERV 16.5 - CODE4LIB Archives

On Jan 29, 2018, at 5:30 PM, Tom Hutchinson <[log in to unmask]> wrote:

> My workflow for deduping has been:
> 
>   -Export a big batch of MARC records
> 
>   -Load the file into a program
> 
>   -Have the program process the records one at a time
> 
>   -For each record, load them into a data structure. Organize the data
> structure so that duplicate records are all stored together. This may
> require additional logic.
> 
>   -Once all the records are loaded into the data structure, go through
> the data structure and process each cluster of duplicates. Pick one
> record from each cluster and write it to a new data structure used for
> output.
> 
>   -Go through the output data structure and write those records out to a file


Tom, thank you for sharing, and I’ve taken a different de-duplication tack. 

More specifically, to de-duplicate items from a HathiTrust collection, I:

  1) create a collection while trying to build it from a smaller number of
     libraries with the assumption that each library will hold a
     limited number of similar items

  2) assume each of the items in my HathTrust collection have more
     things in common than differences; assume a given duplicated
     item is just as good another duplicated item (which we all know 
     is not really true)

  3) pour my HathiTrust collection (which is a tab-delimited/CSV file)
     into OpenRefine [1]

  4) using OpenRefine's clustering algorithms, “normalize” as many
     titles as possible, meaning, use things like the Levenshtein
     Algorithm to “correct” differences [2]

  5) delete duplicate items [3]

  6) possibly use OpenRefine's global find/replace tools to manually do
     more normalization

  7) go to Step #4 until tired

  8) sort the collection by title, and manually flag duplicate items

  9) deleted flagged items

 10) done

When and if I learn how to programmatically implement Step #4, then I think I will be able to automate the whole process.

[1] OpenRefine - http://openrefine.org
[2] clustering - https://github.com/OpenRefine/OpenRefine/wiki/Clustering
[3] “Removing duplicate rows when Exact values are found in a column” - https://github.com/OpenRefine/OpenRefine/wiki/Recipes

—
Eric Morgan