Hi, Eric – I haven't worked with HathiTrust but I have done some de-duping projects. How would you do one by hand? My workflow for deduping has been: -Export a big batch of MARC records -Load the file into a program -Have the program process the records one at a time -For each record, load them into a data structure. Organize the data structure so that duplicate records are all stored together. This may require additional logic. -Once all the records are loaded into the data structure, go through the data structure and process each cluster of duplicates. Pick one record from each cluster and write it to a new data structure used for output. -Go through the output data structure and write those records out to a file I use Java and Marc4J. For one project the records had a common field I could use as an identifier. I put them into a HashMap using that identifier as a key. For another project, I put them into an actual database. I think it was Derby. Sqlite is also a good db for the relatively small number of records libraries commonly work with (100k's to millions). I still used the DB as a relatively simple map. Being able to filter and perform additional processing steps with SQL was helpful. OCLC Classify API can also be thrown into the mix: http://classify.oclc.org/classify2/Classify?oclc=6741810&summary=true http://classify.oclc.org/classify2/Classify?oclc=6741810&summary=false Apologies if this info is too rudimentary for where you are starting from. If it's not rudimentary enough, I'd be happy to write a simple Java script that could be used as a starting point. Regards, Tom On Thu, Jan 25, 2018 at 9:54 AM, Eric Lease Morgan <[log in to unmask]> wrote: > Working with the HathiTrust Research Center data can be fun, and I sincerely believe it is an under-utilized system, but creating collections sans duplicates is difficult. Has anybody here figured out a “kewl” way to remove duplicates. > > Creating HathiTrust collections is easy: do search, select items of interest, and repeat until tired. One can then download a CSV file describing the collection, but upon closer inspection MANY of the titles are repeated. I know why this has happened, alas, but how might I automatically/programmatically resolve this issue? I’ve begun experimenting with OpenRefine. Does anybody else have other suggestions? > > — > Eric Morgan