Print

Print


I've wondered if standard number matching  (ISBN, LCCN, OCLC,  
ISSN ...) would be a big piece.  Isn't there such a service from OCLC,  
and another flavor of something-or-other from LibraryThing?

- Naomi

On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:

> To me, "de-duplication" means throwing out some records as  
> duplicates. Are we talking about that, or are we talking about what  
> I call "work set grouping" and others (erroneously in my opinion)  
> call "FRBRization"?
>
> If the latter, I don't think there is any mature open source  
> software that addresses that yet. Or for that matter, any  
> proprietary for-purchase software that you could use as a component  
> in your own tools. Various proprietary software includes a work set  
> grouping feature in it's "black box" (AquaBrowser, Primo, I believe  
> the VTLS ILS).  But I don't know of anything available to do it for  
> you in your own tool.
>
> I've been just starting to give some thought to how to accomplish  
> this, and it's a bit of a tricky problem on several grounds,  
> including computationally (doing it in a way that performs  
> efficiently). One choice is whether you group records at the  
> indexing stage, or on-demand at the retrieval stage. Both have  
> performance implications--we really don't want to slow down  
> retrieval OR indexing.  Usually if you have the choice, you put the  
> slow down at indexing since it only happens "once" in abstract  
> theory. But in fact, with what we do, when indexing that's already  
> been optmized and does not have this feature can take hours or even  
> days with some of our corpuses, and when in fact we do re-index from  
> time to time (including 'incremental' addition to the index of new  
> and changed records)---we really don't want to slow down indexing  
> either.
>
> Jonathan
>
> Bess Sadler wrote:
>> Hi, Mike.
>>
>> I don't know of any off-the-shelf software that does de-duplication  
>> of the kind you're describing, but it would be pretty useful. That  
>> would be awesome if someone wanted to build something like that  
>> into marc4j. Has anyone published any good algorithms for de- 
>> duping? As I understand it, if you have two records that are 100%  
>> identical except for holdings information, that's pretty easy. It  
>> gets harder when one record is more complete than the other, and  
>> very hard when one record has even slightly different information  
>> than the other, to tell whether they are the same record and decide  
>> whose information to privilege. Are there any good de-duping  
>> guidelines out there? When a library contracts out the de-duping of  
>> their catalog, what kind of specific guidelines are they expected  
>> to provide? Anyone know?
>>
>> I remember the open library folks were very interested in this  
>> question. Any open library folks on this list? Did that effort to  
>> de-dupe all those contributed marc records ever go anywhere?
>>
>> Bess
>>
>> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
>>
>>> Very cool! I noticed that a feature, MarcDirStreamReader, is  
>>> capable of
>>> iterating over all marc record files in a given directory. Does  
>>> anyone
>>> know of any de-duplicating efforts done with marc4j? For example,
>>> libraries that have similar holdings would have their records merged
>>> into one record with a location tag somewhere. I know places do it
>>> (consortia etc.) but I haven't been able to find a good open program
>>> that handles stuff like that.
>>>
>>> Mike Beccaria
>>> Systems Librarian
>>> Head of Digital Initiatives
>>> Paul Smith's College
>>> 518.327.6376
>>> [log in to unmask]
>>>
>>>
> -- 
> Jonathan Rochkind
> Digital Services Software Engineer
> The Sheridan Libraries
> Johns Hopkins University
> 410.516.8886 rochkind (at) jhu.edu

Naomi Dushay
[log in to unmask]