LISTSERV 16.5 - CODE4LIB Archives

Hi, Eric –

I haven't worked with HathiTrust but I have done some de-duping projects.

How would you do one by hand?

My workflow for deduping has been:
-Export a big batch of MARC records
-Load the file into a program
-Have the program process the records one at a time
-For each record, load them into a data structure. Organize the data
structure so that duplicate records are all stored together. This may
require additional logic.
-Once all the records are loaded into the data structure, go through
the data structure and process each cluster of duplicates. Pick one
record from each cluster and write it to a new data structure used for
output.
-Go through the output data structure and write those records out to a file

I use Java and Marc4J.

For one project the records had a common field I could use as an
identifier. I put them into a HashMap using that identifier as a key.

For another project, I put them into an actual database. I think it
was Derby. Sqlite is also a good db for the relatively small number of
records libraries commonly work with (100k's to millions). I still
used the DB as a relatively simple map. Being able to filter and
perform additional processing steps with SQL was helpful.

OCLC Classify API can also be thrown into the mix:
http://classify.oclc.org/classify2/Classify?oclc=6741810&summary=true
http://classify.oclc.org/classify2/Classify?oclc=6741810&summary=false

Apologies if this info is too rudimentary for where you are starting
from. If it's not rudimentary enough, I'd be happy to write a simple
Java script that could be used as a starting point.

Regards,

Tom

On Thu, Jan 25, 2018 at 9:54 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> Working with the HathiTrust Research Center data can be fun, and I sincerely believe it is an under-utilized system, but creating collections sans duplicates is difficult. Has anybody here figured out a “kewl” way to remove duplicates.
>
> Creating HathiTrust collections is easy: do search, select items of interest, and repeat until tired. One can then download a CSV file describing the collection, but upon closer inspection MANY of the titles are repeated. I know why this has happened, alas, but how might I automatically/programmatically resolve this issue? I’ve begun experimenting with OpenRefine. Does anybody else have other suggestions?
>
> —
> Eric Morgan