To me, "de-duplication" means throwing out some records as duplicates.
Are we talking about that, or are we talking about what I call "work set
grouping" and others (erroneously in my opinion) call "FRBRization"?
If the latter, I don't think there is any mature open source software
that addresses that yet. Or for that matter, any proprietary
for-purchase software that you could use as a component in your own
tools. Various proprietary software includes a work set grouping feature
in it's "black box" (AquaBrowser, Primo, I believe the VTLS ILS). But I
don't know of anything available to do it for you in your own tool.
I've been just starting to give some thought to how to accomplish this,
and it's a bit of a tricky problem on several grounds, including
computationally (doing it in a way that performs efficiently). One
choice is whether you group records at the indexing stage, or on-demand
at the retrieval stage. Both have performance implications--we really
don't want to slow down retrieval OR indexing. Usually if you have the
choice, you put the slow down at indexing since it only happens "once"
in abstract theory. But in fact, with what we do, when indexing that's
already been optmized and does not have this feature can take hours or
even days with some of our corpuses, and when in fact we do re-index
from time to time (including 'incremental' addition to the index of new
and changed records)---we really don't want to slow down indexing either.
Jonathan
Bess Sadler wrote:
> Hi, Mike.
>
> I don't know of any off-the-shelf software that does de-duplication of
> the kind you're describing, but it would be pretty useful. That would
> be awesome if someone wanted to build something like that into marc4j.
> Has anyone published any good algorithms for de-duping? As I
> understand it, if you have two records that are 100% identical except
> for holdings information, that's pretty easy. It gets harder when one
> record is more complete than the other, and very hard when one record
> has even slightly different information than the other, to tell
> whether they are the same record and decide whose information to
> privilege. Are there any good de-duping guidelines out there? When a
> library contracts out the de-duping of their catalog, what kind of
> specific guidelines are they expected to provide? Anyone know?
>
> I remember the open library folks were very interested in this
> question. Any open library folks on this list? Did that effort to
> de-dupe all those contributed marc records ever go anywhere?
>
> Bess
>
> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
>
>> Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
>> iterating over all marc record files in a given directory. Does anyone
>> know of any de-duplicating efforts done with marc4j? For example,
>> libraries that have similar holdings would have their records merged
>> into one record with a location tag somewhere. I know places do it
>> (consortia etc.) but I haven't been able to find a good open program
>> that handles stuff like that.
>>
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiatives
>> Paul Smith's College
>> 518.327.6376
>> [log in to unmask]
>>
>> ---
>> This message may contain confidential information and is intended only
>> for the individual named. If you are not the named addressee you should
>> not disseminate, distribute or copy this e-mail. Please notify the
>> sender immediately by e-mail if you have received this e-mail by mistake
>> and delete this e-mail from your system.
>>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> Bess Sadler
>> Sent: Monday, October 20, 2008 11:12 AM
>> To: [log in to unmask]
>> Subject: [CODE4LIB] marc4j 2.4 released
>>
>> Dear Code4Libbers,
>>
>> I'm very pleased to announce that for the first time in almost two
>> years there has been a new release of marc4j. Release 2.4 is a minor
>> release in the sense that it shouldn't break any existing code, but
>> it's a major release in the sense that it represents an influx of new
>> people into the development of this project, and a significant
>> improvement in marc4j's ability to handle malformed or mis-encoded
>> marc records.
>>
>> Release notes are here: http://marc4j.tigris.org/files/documents/
>> 220/44060/changes.txt
>>
>> And the project website, including download links, is here: http://
>> marc4j.tigris.org/
>>
>> We've been using this new marc4j code in solrmarc since solrmarc
>> started, so if you're using Blacklight or VuFind, you're probably
>> using it already, just in an unreleased form.
>>
>> Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
>> improvements to marc4j and getting this release out the door.
>>
>> Bess
>>
>> Elizabeth (Bess) Sadler
>> Research and Development Librarian
>> Digital Scholarship Services
>> Box 400129
>> Alderman Library
>> University of Virginia
>> Charlottesville, VA 22904
>>
>> [log in to unmask]
>> (434) 243-2305
>
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu
|