The question of intentional addition of promotional, misleading, or low
value information to catalog records is an interesting one.
Makes me wonder how often DOIs or shortened URLs resolve to paywalls or
even worse, affiliate links.
On Thu, Sep 19, 2019, 11:30 McDonald, Stephen <[log in to unmask]>
> I think you might be able create an analysis to gauge whether a summary is
> likely to be neutral. But I would resist labeling a summary source as
> "publisher" or "librarian" without actual evidence of the summary author.
> As an aside, I will note that RDA Beta gives the ability to provide a
> source for any piece of metadata within a record. But of course, MARC is
> not quite as flexible.
> Steve McDonald
> [log in to unmask]
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Tim
> Sent: Thursday, September 19, 2019 12:49 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] Identifying description sources across a large corpus
> of MARC records
> [I also put this on AUTOCAT. Apologies if you also follow that. This falls
> at the intersection of hand-cataloging, data processing and simple AI.]
> I wonder if anyone has thoughts on the best way to identify the source of
> summary/description data (520s) across a large corpus of MARC records?
> My primary goal is to distinguish between more neutral, librarian-written
> summaries, and the more promotional summaries derived from publishers
> sources, whether typed in from flap copy or produced by ONIX-MARC
> conversion. I can see a number of uses for this distinction; one is that
> members of LibraryThing much prefer short, neutral descriptions, and abhor
> the lengthy purple prose of many publisher descriptions.
> So far I have a few ideas, but I'd love your thoughts on more:
> The 520 $c (and $u, $2) ought to have source information. But it's rarely
> filled out. Are there any other "tells"?
> Some catalogers put "-- Publisher" or "-- Publisher description" at the
> end of a 520 that comes from a publisher. I'd be interested in hearing
> about other conventions you use or know.
> I could compare the MARC descriptions I have with similar data from
> Ingram, Amazon and Bowker, which (mostly) come from publishers. If they
> match, it's probably publisher provided. (All this ignores summaries that
> come from non-library, non-publisher sources.)
> 100% accuracy will no doubt elude me, but if I can identify a large set of
> both publisher and non-publisher, I can perhaps use them as a training set
> for a Bayesian filter. It's probably that certain words mark something out
> as publisher-derived—"much-anticipated,"
> "bestselling," "seminal," etc.
> There are legal or OCLC-annoying issues with re-sharing the descriptions
> outright, but we'd be more than willing to what we conclude about summaries
> with the larger cataloging world, such as via hash.
> What do you think?
> Tim Spalding
> PS: It's probably too far outside my goals, but it would be interesting to
> go farther. As a former classics scholar-in-training, I'm tempted to try to
> automate the creation of a text-critical stemma--a tree built by analyzing
> accumulated changes--and try to peg the versions to their contributing
> institutions. Could be cool, no?