On Jan 29, 2018, at 5:30 PM, Tom Hutchinson <[log in to unmask]> wrote:
> My workflow for deduping has been:
>
> -Export a big batch of MARC records
>
> -Load the file into a program
>
> -Have the program process the records one at a time
>
> -For each record, load them into a data structure. Organize the data
> structure so that duplicate records are all stored together. This may
> require additional logic.
>
> -Once all the records are loaded into the data structure, go through
> the data structure and process each cluster of duplicates. Pick one
> record from each cluster and write it to a new data structure used for
> output.
>
> -Go through the output data structure and write those records out to a file
Tom, thank you for sharing, and I’ve taken a different de-duplication tack.
More specifically, to de-duplicate items from a HathiTrust collection, I:
1) create a collection while trying to build it from a smaller number of
libraries with the assumption that each library will hold a
limited number of similar items
2) assume each of the items in my HathTrust collection have more
things in common than differences; assume a given duplicated
item is just as good another duplicated item (which we all know
is not really true)
3) pour my HathiTrust collection (which is a tab-delimited/CSV file)
into OpenRefine [1]
4) using OpenRefine's clustering algorithms, “normalize” as many
titles as possible, meaning, use things like the Levenshtein
Algorithm to “correct” differences [2]
5) delete duplicate items [3]
6) possibly use OpenRefine's global find/replace tools to manually do
more normalization
7) go to Step #4 until tired
8) sort the collection by title, and manually flag duplicate items
9) deleted flagged items
10) done
When and if I learn how to programmatically implement Step #4, then I think I will be able to automate the whole process.
[1] OpenRefine - http://openrefine.org
[2] clustering - https://github.com/OpenRefine/OpenRefine/wiki/Clustering
[3] “Removing duplicate rows when Exact values are found in a column” - https://github.com/OpenRefine/OpenRefine/wiki/Recipes
—
Eric Morgan
|