On Jan 29, 2018, at 5:30 PM, Tom Hutchinson <[log in to unmask]> wrote: > My workflow for deduping has been: > > -Export a big batch of MARC records > > -Load the file into a program > > -Have the program process the records one at a time > > -For each record, load them into a data structure. Organize the data > structure so that duplicate records are all stored together. This may > require additional logic. > > -Once all the records are loaded into the data structure, go through > the data structure and process each cluster of duplicates. Pick one > record from each cluster and write it to a new data structure used for > output. > > -Go through the output data structure and write those records out to a file Tom, thank you for sharing, and I’ve taken a different de-duplication tack. More specifically, to de-duplicate items from a HathiTrust collection, I: 1) create a collection while trying to build it from a smaller number of libraries with the assumption that each library will hold a limited number of similar items 2) assume each of the items in my HathTrust collection have more things in common than differences; assume a given duplicated item is just as good another duplicated item (which we all know is not really true) 3) pour my HathiTrust collection (which is a tab-delimited/CSV file) into OpenRefine [1] 4) using OpenRefine's clustering algorithms, “normalize” as many titles as possible, meaning, use things like the Levenshtein Algorithm to “correct” differences [2] 5) delete duplicate items [3] 6) possibly use OpenRefine's global find/replace tools to manually do more normalization 7) go to Step #4 until tired 8) sort the collection by title, and manually flag duplicate items 9) deleted flagged items 10) done When and if I learn how to programmatically implement Step #4, then I think I will be able to automate the whole process. [1] OpenRefine - http://openrefine.org [2] clustering - https://github.com/OpenRefine/OpenRefine/wiki/Clustering [3] “Removing duplicate rows when Exact values are found in a column” - https://github.com/OpenRefine/OpenRefine/wiki/Recipes — Eric Morgan