Hi all:
My student, Yee Fan Tan, and I published a short technical column on
record linkage tasks (very similar to the de-dup task discussed here)
in February in the Communications of the ACM.
Min-Yen Kan and Yee Fan Tan (2008) "Record matching in digital library
metadata". In Communications of the ACM, Technical opinion column,
pp. 91-94, February.
http://doi.acm.org/10.1145/1314215.1314231
We're in the process of releasing a tool/demo for de-dup tasks, as a
java library (jar). If there's sufficient interest, we might try to
cater some of our string similarity metrics to MARC or other catalog
data.
Cheers,
Min
--
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W)
Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.
On Tue, Oct 21, 2008 at 8:03 AM, Naomi Dushay <[log in to unmask]> wrote:
> I've wondered if standard number matching (ISBN, LCCN, OCLC, ISSN ...)
> would be a big piece. Isn't there such a service from OCLC, and another
> flavor of something-or-other from LibraryThing?
>
> - Naomi
>
> On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:
>
>> To me, "de-duplication" means throwing out some records as duplicates. Are
>> we talking about that, or are we talking about what I call "work set
>> grouping" and others (erroneously in my opinion) call "FRBRization"?
>>
>> If the latter, I don't think there is any mature open source software that
>> addresses that yet. Or for that matter, any proprietary for-purchase
>> software that you could use as a component in your own tools. Various
>> proprietary software includes a work set grouping feature in it's "black
>> box" (AquaBrowser, Primo, I believe the VTLS ILS). But I don't know of
>> anything available to do it for you in your own tool.
>>
>> I've been just starting to give some thought to how to accomplish this,
>> and it's a bit of a tricky problem on several grounds, including
>> computationally (doing it in a way that performs efficiently). One choice is
>> whether you group records at the indexing stage, or on-demand at the
>> retrieval stage. Both have performance implications--we really don't want to
>> slow down retrieval OR indexing. Usually if you have the choice, you put
>> the slow down at indexing since it only happens "once" in abstract theory.
>> But in fact, with what we do, when indexing that's already been optmized and
>> does not have this feature can take hours or even days with some of our
>> corpuses, and when in fact we do re-index from time to time (including
>> 'incremental' addition to the index of new and changed records)---we really
>> don't want to slow down indexing either.
>>
>> Jonathan
>>
>> Bess Sadler wrote:
>>>
>>> Hi, Mike.
>>>
>>> I don't know of any off-the-shelf software that does de-duplication of
>>> the kind you're describing, but it would be pretty useful. That would be
>>> awesome if someone wanted to build something like that into marc4j. Has
>>> anyone published any good algorithms for de-duping? As I understand it, if
>>> you have two records that are 100% identical except for holdings
>>> information, that's pretty easy. It gets harder when one record is more
>>> complete than the other, and very hard when one record has even slightly
>>> different information than the other, to tell whether they are the same
>>> record and decide whose information to privilege. Are there any good
>>> de-duping guidelines out there? When a library contracts out the de-duping
>>> of their catalog, what kind of specific guidelines are they expected to
>>> provide? Anyone know?
>>>
>>> I remember the open library folks were very interested in this question.
>>> Any open library folks on this list? Did that effort to de-dupe all those
>>> contributed marc records ever go anywhere?
>>>
>>> Bess
>>>
>>> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
>>>
>>>> Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
>>>> iterating over all marc record files in a given directory. Does anyone
>>>> know of any de-duplicating efforts done with marc4j? For example,
>>>> libraries that have similar holdings would have their records merged
>>>> into one record with a location tag somewhere. I know places do it
>>>> (consortia etc.) but I haven't been able to find a good open program
>>>> that handles stuff like that.
>>>>
>>>> Mike Beccaria
>>>> Systems Librarian
>>>> Head of Digital Initiatives
>>>> Paul Smith's College
>>>> 518.327.6376
>>>> [log in to unmask]
>>>>
>>>>
>> --
>> Jonathan Rochkind
>> Digital Services Software Engineer
>> The Sheridan Libraries
>> Johns Hopkins University
>> 410.516.8886 rochkind (at) jhu.edu
>
> Naomi Dushay
> [log in to unmask]
>
|