Terry Reese wrote a program called RobertCompare a few years back http://oregonstate.edu/~reeset/marcedit/html/robertcompare.html that could compare MARC records and tell you about differences. Perhaps that would be useful. kyle On Mon, Oct 20, 2008 at 11:55 AM, Bess Sadler <[log in to unmask]> wrote: > Hi, Mike. > > I don't know of any off-the-shelf software that does de-duplication of the > kind you're describing, but it would be pretty useful. That would be awesome > if someone wanted to build something like that into marc4j. Has anyone > published any good algorithms for de-duping? As I understand it, if you have > two records that are 100% identical except for holdings information, that's > pretty easy. It gets harder when one record is more complete than the other, > and very hard when one record has even slightly different information than > the other, to tell whether they are the same record and decide whose > information to privilege. Are there any good de-duping guidelines out there? > When a library contracts out the de-duping of their catalog, what kind of > specific guidelines are they expected to provide? Anyone know? > > I remember the open library folks were very interested in this question. Any > open library folks on this list? Did that effort to de-dupe all those > contributed marc records ever go anywhere? > > Bess > > On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: > >> Very cool! I noticed that a feature, MarcDirStreamReader, is capable of >> iterating over all marc record files in a given directory. Does anyone >> know of any de-duplicating efforts done with marc4j? For example, >> libraries that have similar holdings would have their records merged >> into one record with a location tag somewhere. I know places do it >> (consortia etc.) but I haven't been able to find a good open program >> that handles stuff like that. >> >> Mike Beccaria >> Systems Librarian >> Head of Digital Initiatives >> Paul Smith's College >> 518.327.6376 >> [log in to unmask] >> >> --- >> This message may contain confidential information and is intended only >> for the individual named. If you are not the named addressee you should >> not disseminate, distribute or copy this e-mail. Please notify the >> sender immediately by e-mail if you have received this e-mail by mistake >> and delete this e-mail from your system. >> >> -----Original Message----- >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of >> Bess Sadler >> Sent: Monday, October 20, 2008 11:12 AM >> To: [log in to unmask] >> Subject: [CODE4LIB] marc4j 2.4 released >> >> Dear Code4Libbers, >> >> I'm very pleased to announce that for the first time in almost two >> years there has been a new release of marc4j. Release 2.4 is a minor >> release in the sense that it shouldn't break any existing code, but >> it's a major release in the sense that it represents an influx of new >> people into the development of this project, and a significant >> improvement in marc4j's ability to handle malformed or mis-encoded >> marc records. >> >> Release notes are here: http://marc4j.tigris.org/files/documents/ >> 220/44060/changes.txt >> >> And the project website, including download links, is here: http:// >> marc4j.tigris.org/ >> >> We've been using this new marc4j code in solrmarc since solrmarc >> started, so if you're using Blacklight or VuFind, you're probably >> using it already, just in an unreleased form. >> >> Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these >> improvements to marc4j and getting this release out the door. >> >> Bess >> >> Elizabeth (Bess) Sadler >> Research and Development Librarian >> Digital Scholarship Services >> Box 400129 >> Alderman Library >> University of Virginia >> Charlottesville, VA 22904 >> >> [log in to unmask] >> (434) 243-2305 > -- ---------------------------------------------------------- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [log in to unmask] / 541.359.9599