I've wondered if standard number matching (ISBN, LCCN, OCLC, ISSN ...) would be a big piece. Isn't there such a service from OCLC, and another flavor of something-or-other from LibraryThing? - Naomi On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote: > To me, "de-duplication" means throwing out some records as > duplicates. Are we talking about that, or are we talking about what > I call "work set grouping" and others (erroneously in my opinion) > call "FRBRization"? > > If the latter, I don't think there is any mature open source > software that addresses that yet. Or for that matter, any > proprietary for-purchase software that you could use as a component > in your own tools. Various proprietary software includes a work set > grouping feature in it's "black box" (AquaBrowser, Primo, I believe > the VTLS ILS). But I don't know of anything available to do it for > you in your own tool. > > I've been just starting to give some thought to how to accomplish > this, and it's a bit of a tricky problem on several grounds, > including computationally (doing it in a way that performs > efficiently). One choice is whether you group records at the > indexing stage, or on-demand at the retrieval stage. Both have > performance implications--we really don't want to slow down > retrieval OR indexing. Usually if you have the choice, you put the > slow down at indexing since it only happens "once" in abstract > theory. But in fact, with what we do, when indexing that's already > been optmized and does not have this feature can take hours or even > days with some of our corpuses, and when in fact we do re-index from > time to time (including 'incremental' addition to the index of new > and changed records)---we really don't want to slow down indexing > either. > > Jonathan > > Bess Sadler wrote: >> Hi, Mike. >> >> I don't know of any off-the-shelf software that does de-duplication >> of the kind you're describing, but it would be pretty useful. That >> would be awesome if someone wanted to build something like that >> into marc4j. Has anyone published any good algorithms for de- >> duping? As I understand it, if you have two records that are 100% >> identical except for holdings information, that's pretty easy. It >> gets harder when one record is more complete than the other, and >> very hard when one record has even slightly different information >> than the other, to tell whether they are the same record and decide >> whose information to privilege. Are there any good de-duping >> guidelines out there? When a library contracts out the de-duping of >> their catalog, what kind of specific guidelines are they expected >> to provide? Anyone know? >> >> I remember the open library folks were very interested in this >> question. Any open library folks on this list? Did that effort to >> de-dupe all those contributed marc records ever go anywhere? >> >> Bess >> >> On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: >> >>> Very cool! I noticed that a feature, MarcDirStreamReader, is >>> capable of >>> iterating over all marc record files in a given directory. Does >>> anyone >>> know of any de-duplicating efforts done with marc4j? For example, >>> libraries that have similar holdings would have their records merged >>> into one record with a location tag somewhere. I know places do it >>> (consortia etc.) but I haven't been able to find a good open program >>> that handles stuff like that. >>> >>> Mike Beccaria >>> Systems Librarian >>> Head of Digital Initiatives >>> Paul Smith's College >>> 518.327.6376 >>> [log in to unmask] >>> >>> > -- > Jonathan Rochkind > Digital Services Software Engineer > The Sheridan Libraries > Johns Hopkins University > 410.516.8886 rochkind (at) jhu.edu Naomi Dushay [log in to unmask]