Hi all: My student, Yee Fan Tan, and I published a short technical column on record linkage tasks (very similar to the de-dup task discussed here) in February in the Communications of the ACM. Min-Yen Kan and Yee Fan Tan (2008) "Record matching in digital library metadata". In Communications of the ACM, Technical opinion column, pp. 91-94, February. http://doi.acm.org/10.1145/1314215.1314231 We're in the process of releasing a tool/demo for de-dup tasks, as a java library (jar). If there's sufficient interest, we might try to cater some of our string similarity metrics to MARC or other catalog data. Cheers, Min -- Min-Yen KAN (Dr) :: Assistant Professor :: National University of Singapore :: School of Computing, AS6 05-12, Law Link, Singapore 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W) Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. On Tue, Oct 21, 2008 at 8:03 AM, Naomi Dushay <[log in to unmask]> wrote: > I've wondered if standard number matching (ISBN, LCCN, OCLC, ISSN ...) > would be a big piece. Isn't there such a service from OCLC, and another > flavor of something-or-other from LibraryThing? > > - Naomi > > On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote: > >> To me, "de-duplication" means throwing out some records as duplicates. Are >> we talking about that, or are we talking about what I call "work set >> grouping" and others (erroneously in my opinion) call "FRBRization"? >> >> If the latter, I don't think there is any mature open source software that >> addresses that yet. Or for that matter, any proprietary for-purchase >> software that you could use as a component in your own tools. Various >> proprietary software includes a work set grouping feature in it's "black >> box" (AquaBrowser, Primo, I believe the VTLS ILS). But I don't know of >> anything available to do it for you in your own tool. >> >> I've been just starting to give some thought to how to accomplish this, >> and it's a bit of a tricky problem on several grounds, including >> computationally (doing it in a way that performs efficiently). One choice is >> whether you group records at the indexing stage, or on-demand at the >> retrieval stage. Both have performance implications--we really don't want to >> slow down retrieval OR indexing. Usually if you have the choice, you put >> the slow down at indexing since it only happens "once" in abstract theory. >> But in fact, with what we do, when indexing that's already been optmized and >> does not have this feature can take hours or even days with some of our >> corpuses, and when in fact we do re-index from time to time (including >> 'incremental' addition to the index of new and changed records)---we really >> don't want to slow down indexing either. >> >> Jonathan >> >> Bess Sadler wrote: >>> >>> Hi, Mike. >>> >>> I don't know of any off-the-shelf software that does de-duplication of >>> the kind you're describing, but it would be pretty useful. That would be >>> awesome if someone wanted to build something like that into marc4j. Has >>> anyone published any good algorithms for de-duping? As I understand it, if >>> you have two records that are 100% identical except for holdings >>> information, that's pretty easy. It gets harder when one record is more >>> complete than the other, and very hard when one record has even slightly >>> different information than the other, to tell whether they are the same >>> record and decide whose information to privilege. Are there any good >>> de-duping guidelines out there? When a library contracts out the de-duping >>> of their catalog, what kind of specific guidelines are they expected to >>> provide? Anyone know? >>> >>> I remember the open library folks were very interested in this question. >>> Any open library folks on this list? Did that effort to de-dupe all those >>> contributed marc records ever go anywhere? >>> >>> Bess >>> >>> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: >>> >>>> Very cool! I noticed that a feature, MarcDirStreamReader, is capable of >>>> iterating over all marc record files in a given directory. Does anyone >>>> know of any de-duplicating efforts done with marc4j? For example, >>>> libraries that have similar holdings would have their records merged >>>> into one record with a location tag somewhere. I know places do it >>>> (consortia etc.) but I haven't been able to find a good open program >>>> that handles stuff like that. >>>> >>>> Mike Beccaria >>>> Systems Librarian >>>> Head of Digital Initiatives >>>> Paul Smith's College >>>> 518.327.6376 >>>> [log in to unmask] >>>> >>>> >> -- >> Jonathan Rochkind >> Digital Services Software Engineer >> The Sheridan Libraries >> Johns Hopkins University >> 410.516.8886 rochkind (at) jhu.edu > > Naomi Dushay > [log in to unmask] >