Terry Reese wrote a program called RobertCompare a few years back
http://oregonstate.edu/~reeset/marcedit/html/robertcompare.html that
could compare MARC records and tell you about differences. Perhaps
that would be useful.
kyle
On Mon, Oct 20, 2008 at 11:55 AM, Bess Sadler <[log in to unmask]> wrote:
> Hi, Mike.
>
> I don't know of any off-the-shelf software that does de-duplication of the
> kind you're describing, but it would be pretty useful. That would be awesome
> if someone wanted to build something like that into marc4j. Has anyone
> published any good algorithms for de-duping? As I understand it, if you have
> two records that are 100% identical except for holdings information, that's
> pretty easy. It gets harder when one record is more complete than the other,
> and very hard when one record has even slightly different information than
> the other, to tell whether they are the same record and decide whose
> information to privilege. Are there any good de-duping guidelines out there?
> When a library contracts out the de-duping of their catalog, what kind of
> specific guidelines are they expected to provide? Anyone know?
>
> I remember the open library folks were very interested in this question. Any
> open library folks on this list? Did that effort to de-dupe all those
> contributed marc records ever go anywhere?
>
> Bess
>
> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
>
>> Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
>> iterating over all marc record files in a given directory. Does anyone
>> know of any de-duplicating efforts done with marc4j? For example,
>> libraries that have similar holdings would have their records merged
>> into one record with a location tag somewhere. I know places do it
>> (consortia etc.) but I haven't been able to find a good open program
>> that handles stuff like that.
>>
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiatives
>> Paul Smith's College
>> 518.327.6376
>> [log in to unmask]
>>
>> ---
>> This message may contain confidential information and is intended only
>> for the individual named. If you are not the named addressee you should
>> not disseminate, distribute or copy this e-mail. Please notify the
>> sender immediately by e-mail if you have received this e-mail by mistake
>> and delete this e-mail from your system.
>>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> Bess Sadler
>> Sent: Monday, October 20, 2008 11:12 AM
>> To: [log in to unmask]
>> Subject: [CODE4LIB] marc4j 2.4 released
>>
>> Dear Code4Libbers,
>>
>> I'm very pleased to announce that for the first time in almost two
>> years there has been a new release of marc4j. Release 2.4 is a minor
>> release in the sense that it shouldn't break any existing code, but
>> it's a major release in the sense that it represents an influx of new
>> people into the development of this project, and a significant
>> improvement in marc4j's ability to handle malformed or mis-encoded
>> marc records.
>>
>> Release notes are here: http://marc4j.tigris.org/files/documents/
>> 220/44060/changes.txt
>>
>> And the project website, including download links, is here: http://
>> marc4j.tigris.org/
>>
>> We've been using this new marc4j code in solrmarc since solrmarc
>> started, so if you're using Blacklight or VuFind, you're probably
>> using it already, just in an unreleased form.
>>
>> Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
>> improvements to marc4j and getting this release out the door.
>>
>> Bess
>>
>> Elizabeth (Bess) Sadler
>> Research and Development Librarian
>> Digital Scholarship Services
>> Box 400129
>> Alderman Library
>> University of Virginia
>> Charlottesville, VA 22904
>>
>> [log in to unmask]
>> (434) 243-2305
>
--
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[log in to unmask] / 541.359.9599
|