I've wondered if standard number matching (ISBN, LCCN, OCLC,
ISSN ...) would be a big piece. Isn't there such a service from OCLC,
and another flavor of something-or-other from LibraryThing?
- Naomi
On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:
> To me, "de-duplication" means throwing out some records as
> duplicates. Are we talking about that, or are we talking about what
> I call "work set grouping" and others (erroneously in my opinion)
> call "FRBRization"?
>
> If the latter, I don't think there is any mature open source
> software that addresses that yet. Or for that matter, any
> proprietary for-purchase software that you could use as a component
> in your own tools. Various proprietary software includes a work set
> grouping feature in it's "black box" (AquaBrowser, Primo, I believe
> the VTLS ILS). But I don't know of anything available to do it for
> you in your own tool.
>
> I've been just starting to give some thought to how to accomplish
> this, and it's a bit of a tricky problem on several grounds,
> including computationally (doing it in a way that performs
> efficiently). One choice is whether you group records at the
> indexing stage, or on-demand at the retrieval stage. Both have
> performance implications--we really don't want to slow down
> retrieval OR indexing. Usually if you have the choice, you put the
> slow down at indexing since it only happens "once" in abstract
> theory. But in fact, with what we do, when indexing that's already
> been optmized and does not have this feature can take hours or even
> days with some of our corpuses, and when in fact we do re-index from
> time to time (including 'incremental' addition to the index of new
> and changed records)---we really don't want to slow down indexing
> either.
>
> Jonathan
>
> Bess Sadler wrote:
>> Hi, Mike.
>>
>> I don't know of any off-the-shelf software that does de-duplication
>> of the kind you're describing, but it would be pretty useful. That
>> would be awesome if someone wanted to build something like that
>> into marc4j. Has anyone published any good algorithms for de-
>> duping? As I understand it, if you have two records that are 100%
>> identical except for holdings information, that's pretty easy. It
>> gets harder when one record is more complete than the other, and
>> very hard when one record has even slightly different information
>> than the other, to tell whether they are the same record and decide
>> whose information to privilege. Are there any good de-duping
>> guidelines out there? When a library contracts out the de-duping of
>> their catalog, what kind of specific guidelines are they expected
>> to provide? Anyone know?
>>
>> I remember the open library folks were very interested in this
>> question. Any open library folks on this list? Did that effort to
>> de-dupe all those contributed marc records ever go anywhere?
>>
>> Bess
>>
>> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
>>
>>> Very cool! I noticed that a feature, MarcDirStreamReader, is
>>> capable of
>>> iterating over all marc record files in a given directory. Does
>>> anyone
>>> know of any de-duplicating efforts done with marc4j? For example,
>>> libraries that have similar holdings would have their records merged
>>> into one record with a location tag somewhere. I know places do it
>>> (consortia etc.) but I haven't been able to find a good open program
>>> that handles stuff like that.
>>>
>>> Mike Beccaria
>>> Systems Librarian
>>> Head of Digital Initiatives
>>> Paul Smith's College
>>> 518.327.6376
>>> [log in to unmask]
>>>
>>>
> --
> Jonathan Rochkind
> Digital Services Software Engineer
> The Sheridan Libraries
> Johns Hopkins University
> 410.516.8886 rochkind (at) jhu.edu
Naomi Dushay
[log in to unmask]
|