Print

Print


I'm sure there are plenty of us (say, me for example) who would love to
know what algorithms people are using to de-dupe MARC records (using
"duplicate" to mean "completely different record describing the exact same
resource). Anyone have something that works well-enough for them?

On Fri, Mar 30, 2012 at 4:23 AM, Michael Hopwood <[log in to unmask]>wrote:

> Hi Peter, Graham,
>
> I'm interested! Myself and a few friends and colleagues in the UK here are
> looking at this and similar questions as part of an ongoing interest in
> linked library data, and the schema/data model issues that underlie how
> useful it could be.
>
> What we have here is certainly related to the data model (mostly implicit)
> behind MaRC records, which is an interesting blend of expression,
> manifestation and maybe item-level data.
>
> Options for merging, linking and otherwise combining various records is of
> interest to us both "theoretically" and practically i.e. technically, so we
> would be pleased to contribute to and benefit from the conversation on this.
>
> Best wishes,
>
> Michael Hopwood
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Peter Noerr
> Sent: 30 March 2012 01:09
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] presenting merged records?
>
> Hi Graham,
>
> What we do in our federated search system, and have been doing for some
> few years, is basically give the "designer" a choice of what options the
> user gets for "de-duped" records.
>
> Firstly de-duping can be of a number of levels of sophistication, and a
> many of them lead to the situation you have - records which are "similar"
> rather than identical. On the web search side of things there are a
> surprising number of real duplicates (well maybe not surprising if you
> study more than one page of web search engine results), and on Twitter the
> duplicates well outnumber the original posts (many thanks 're-tweet').
>
> Where we get duplicate records the usual options are: 1) keep the first
> and just drop all the rest. 2) keep the largest (assumed to have the most
> information) and drop the rest. These work well for WSE results where they
> are all almost identical (the differences often are just in the advertising
> attached to the pages and the results), but not for bibliographic records.
>
> Less draconian is 3) Mark all the duplicates and keep them in the list (so
> you get 1, 2, 3, 4, 5, 5.1, 5.2, 5.3, 6, ...). This groups all the similar
> records together under the sort key of the first one, and does enable the
> user to easily skip them.
>
> More user friendly is 4) Mark all duplicates and hide them in a sub-list
> attached to the "head" record. This gets them out of the main display, but
> allows the user who is interested in that "record" to expand the list and
> see the variants. This could be of use to you.
>
> After that we planned to do what you are proposing and actually merge
> record content into a single virtual record, and worked on algorithms to do
> it. But nobody was interested. All our partners (who provide systems to
> lots of libraries, both public, academic, and special) decided that it
> would confuse their users more than it would help. I have my doubts, but
> they spoke and we put the development on ice.
>
> I'm not sure this will help, but it has stood the test of time, and is
> well used in its various guises. Since no-one else seems interested in this
> topic, you could email me off list and we could discuss what we worked
> through in the way of algorithms, etc.
>
> Peter
>
>
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of graham
> > Sent: Wednesday, March 28, 2012 8:05 AM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] presenting merged records?
> >
> > Hi Michael
> >
> > On 03/27/12 11:50, Michael Hopwood wrote:
> > > Hi Graham, do I know you from RHUL?
> > >
> > Yes indeed :-)
> >
> > > My thoughts on "merged records" would be:
> > >
> > > 1. don't do it - use separate IDs and just present links between
> > > related manifestations; thus
> > avoiding potential confusions.
> >
> > In my case, I can't avoid it as it's a specific requirement: I'm doing
> > a federated search across a large number of libraries, and if closely
> > similar items aren't merged, the results become excessively large and
> > repetitive. I'm merging all the similar items, displaying a summary of
> > the merged bibliographic data, and providing links to each of the
> > libraries with a copy.  So it's not really FRBRization in the normal
> > sense, I just thought that FRBRization would lead to similar problems,
> so that there might be some well-known discussion of the issues around...
> The merger of the records does have advantages, especially if some
> libraries have very underpopulated records (eg subject fields).
> >
> > Cheers
> > Graham
> >
> > >
> > > http://www.bic.org.uk/files/pdfs/identification-digibook.pdf
> > >
> > > possible relationships - see
> > > http://www.editeur.org/ONIX/book/codelists/current.html - lists 51
> > (manifestation)and 164 (work).
> > >
> > > 2. c.f. the way Amazon displays rough and ready categories
> > > (paperback, hardback, audiobooks, *ahem* ebooks of some sort...)
> > >
> > > On dissection and reconstitution of records - there is a lot of talk
> > > going on about RDFizing MaRC
> > records and re-using in various ways, e.g.:
> > >
> > > http://www.slideshare.net/JenniferBowen/moving-library-metadata-towa
> > > rd -linked-data-opportunities-provided-by-the-extensible-catalog
> > >
> > > Cheers,
> > >
> > > Michael
> > >
> > > -----Original Message-----
> > > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > > Of graham
> > > Sent: 27 March 2012 11:06
> > > To: [log in to unmask]
> > > Subject: [CODE4LIB] presenting merged records?
> > >
> > > Hi
> > >
> > > There seems to be a general trend to presenting merged records to
> > > users, as part of the move towards
> > FRBRization. If records need merging this generally means they weren't
> > totally identical to start with, so you can end up with conflicting
> bibliographic data to display.
> > >
> > > Two examples I've come across with this: Summon can merge
> > > print/electronic versions of texts, so uses a new 'merged' material
> > > type of 'book/ebook' (it doesn't yet seem to have all the other
> > > possible permutations, eg book/audiobook). Pazpar2 (which I'm
> > > working with at the
> > > moment) has a merge option for publication dates which presents dates
> as a period eg 1997-2002.
> > >
> > > The problem is not with the underlying data (the original unmerged
> > > values can still be there in the
> > background) but how to present them to the user in an intuitive way.
> > With the date example, presenting dates in this format sometimes
> > throws people as it looks too much like the author birth/death dates you
> might see with a record.
> > >
> > > I guess people must generally be starting to run into this kind of
> > > display problem, so it has maybe
> > been discussed to death on ... wherever it is people talk about
> > FRBRIzation. Any suggestions? Any mailing lists, blogs etc any can
> recommend for me to look at?
> > >
> > > Thanks for any ideas
> > > Graham
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library