Your mileage may vary, but MarcEdit has a dedup tool that will allow you to
take two files and find duplications. It also has a merge tool that will
allow you to take two files, and merge specific fields into one or another
(so if you want fields like the 856 from two packages in the same record).
There are some assumptions made when matching (dedup can use any
field/subfield pair, but obviously control numbers are better -- merging is
done using a heuristic analysis of 20-25 different field points, with
significance weighted to create a match score for merge), but some folks
find it useful if you don't want to code something up yourself.
Otherwise, if I was coding this, I'd stay away from needing exact matches.
I've found that when doing matches, I like to include a wide range of
elements, and then use a fuzzy match when working with titles because the
title isn't as fixed as you might like -- especially if the cataloging
On Thu, Aug 15, 2013 at 2:29 PM, Andy Kohler <[log in to unmask]> wrote:
> Are you expecting to work with two files of records, outside of your ILS?
> If so, for a project like that I'd probably write Perl script(s) using
> MARC::Record (there are similar code libraries for Ruby, Python and Java at
> For each record in each file, use the ISBN (and/or OCLC number and/or LCCN)
> as a key. Compare all sets, and keep one record per key.
> This assumes that the vendors are supplying records with standard
> identifiers, and not just their own record numbers.
> If you're comparing each file with what's already in your ILS, then it'll
> depend on the tools the ILS offers for matching incoming records to the
> database. Or, export the database and compare it with the files, as above.
> Andy Kohler / UCLA Library Info Tech
> [log in to unmask] / 310 206-8312
> On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria <
> [log in to unmask]
> > wrote:
> > Has anyone had any luck finding a good way to de-duplicate MARC records
> > from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic
> > Ebook collections and they estimate an overlap into the 10's of