LISTSERV 16.5 - CODE4LIB Archives

Working with the HathiTrust Research Center data can be fun, and I sincerely believe it is an under-utilized system, but creating collections sans duplicates is difficult. Has anybody here figured out a “kewl” way to remove duplicates.

Creating HathiTrust collections is easy: do search, select items of interest, and repeat until tired. One can then download a CSV file describing the collection, but upon closer inspection MANY of the titles are repeated. I know why this has happened, alas, but how might I automatically/programmatically resolve this issue? I’ve begun experimenting with OpenRefine. Does anybody else have other suggestions? 

—
Eric Morgan