I'm sure there are plenty of us (say, me for example) who would love to know what algorithms people are using to de-dupe MARC records (using "duplicate" to mean "completely different record describing the exact same resource). Anyone have something that works well-enough for them? On Fri, Mar 30, 2012 at 4:23 AM, Michael Hopwood <[log in to unmask]>wrote: > Hi Peter, Graham, > > I'm interested! Myself and a few friends and colleagues in the UK here are > looking at this and similar questions as part of an ongoing interest in > linked library data, and the schema/data model issues that underlie how > useful it could be. > > What we have here is certainly related to the data model (mostly implicit) > behind MaRC records, which is an interesting blend of expression, > manifestation and maybe item-level data. > > Options for merging, linking and otherwise combining various records is of > interest to us both "theoretically" and practically i.e. technically, so we > would be pleased to contribute to and benefit from the conversation on this. > > Best wishes, > > Michael Hopwood > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Peter Noerr > Sent: 30 March 2012 01:09 > To: [log in to unmask] > Subject: Re: [CODE4LIB] presenting merged records? > > Hi Graham, > > What we do in our federated search system, and have been doing for some > few years, is basically give the "designer" a choice of what options the > user gets for "de-duped" records. > > Firstly de-duping can be of a number of levels of sophistication, and a > many of them lead to the situation you have - records which are "similar" > rather than identical. On the web search side of things there are a > surprising number of real duplicates (well maybe not surprising if you > study more than one page of web search engine results), and on Twitter the > duplicates well outnumber the original posts (many thanks 're-tweet'). > > Where we get duplicate records the usual options are: 1) keep the first > and just drop all the rest. 2) keep the largest (assumed to have the most > information) and drop the rest. These work well for WSE results where they > are all almost identical (the differences often are just in the advertising > attached to the pages and the results), but not for bibliographic records. > > Less draconian is 3) Mark all the duplicates and keep them in the list (so > you get 1, 2, 3, 4, 5, 5.1, 5.2, 5.3, 6, ...). This groups all the similar > records together under the sort key of the first one, and does enable the > user to easily skip them. > > More user friendly is 4) Mark all duplicates and hide them in a sub-list > attached to the "head" record. This gets them out of the main display, but > allows the user who is interested in that "record" to expand the list and > see the variants. This could be of use to you. > > After that we planned to do what you are proposing and actually merge > record content into a single virtual record, and worked on algorithms to do > it. But nobody was interested. All our partners (who provide systems to > lots of libraries, both public, academic, and special) decided that it > would confuse their users more than it would help. I have my doubts, but > they spoke and we put the development on ice. > > I'm not sure this will help, but it has stood the test of time, and is > well used in its various guises. Since no-one else seems interested in this > topic, you could email me off list and we could discuss what we worked > through in the way of algorithms, etc. > > Peter > > > > -----Original Message----- > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf > > Of graham > > Sent: Wednesday, March 28, 2012 8:05 AM > > To: [log in to unmask] > > Subject: Re: [CODE4LIB] presenting merged records? > > > > Hi Michael > > > > On 03/27/12 11:50, Michael Hopwood wrote: > > > Hi Graham, do I know you from RHUL? > > > > > Yes indeed :-) > > > > > My thoughts on "merged records" would be: > > > > > > 1. don't do it - use separate IDs and just present links between > > > related manifestations; thus > > avoiding potential confusions. > > > > In my case, I can't avoid it as it's a specific requirement: I'm doing > > a federated search across a large number of libraries, and if closely > > similar items aren't merged, the results become excessively large and > > repetitive. I'm merging all the similar items, displaying a summary of > > the merged bibliographic data, and providing links to each of the > > libraries with a copy. So it's not really FRBRization in the normal > > sense, I just thought that FRBRization would lead to similar problems, > so that there might be some well-known discussion of the issues around... > The merger of the records does have advantages, especially if some > libraries have very underpopulated records (eg subject fields). > > > > Cheers > > Graham > > > > > > > > http://www.bic.org.uk/files/pdfs/identification-digibook.pdf > > > > > > possible relationships - see > > > http://www.editeur.org/ONIX/book/codelists/current.html - lists 51 > > (manifestation)and 164 (work). > > > > > > 2. c.f. the way Amazon displays rough and ready categories > > > (paperback, hardback, audiobooks, *ahem* ebooks of some sort...) > > > > > > On dissection and reconstitution of records - there is a lot of talk > > > going on about RDFizing MaRC > > records and re-using in various ways, e.g.: > > > > > > http://www.slideshare.net/JenniferBowen/moving-library-metadata-towa > > > rd -linked-data-opportunities-provided-by-the-extensible-catalog > > > > > > Cheers, > > > > > > Michael > > > > > > -----Original Message----- > > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf > > > Of graham > > > Sent: 27 March 2012 11:06 > > > To: [log in to unmask] > > > Subject: [CODE4LIB] presenting merged records? > > > > > > Hi > > > > > > There seems to be a general trend to presenting merged records to > > > users, as part of the move towards > > FRBRization. If records need merging this generally means they weren't > > totally identical to start with, so you can end up with conflicting > bibliographic data to display. > > > > > > Two examples I've come across with this: Summon can merge > > > print/electronic versions of texts, so uses a new 'merged' material > > > type of 'book/ebook' (it doesn't yet seem to have all the other > > > possible permutations, eg book/audiobook). Pazpar2 (which I'm > > > working with at the > > > moment) has a merge option for publication dates which presents dates > as a period eg 1997-2002. > > > > > > The problem is not with the underlying data (the original unmerged > > > values can still be there in the > > background) but how to present them to the user in an intuitive way. > > With the date example, presenting dates in this format sometimes > > throws people as it looks too much like the author birth/death dates you > might see with a record. > > > > > > I guess people must generally be starting to run into this kind of > > > display problem, so it has maybe > > been discussed to death on ... wherever it is people talk about > > FRBRIzation. Any suggestions? Any mailing lists, blogs etc any can > recommend for me to look at? > > > > > > Thanks for any ideas > > > Graham > -- Bill Dueber Library Systems Programmer University of Michigan Library