I think you might be able create an analysis to gauge whether a summary is likely to be neutral. But I would resist labeling a summary source as "publisher" or "librarian" without actual evidence of the summary author.
As an aside, I will note that RDA Beta gives the ability to provide a source for any piece of metadata within a record. But of course, MARC is not quite as flexible.
[log in to unmask]
From: Code for Libraries <[log in to unmask]> On Behalf Of Tim Spalding
Sent: Thursday, September 19, 2019 12:49 PM
To: [log in to unmask]
Subject: [CODE4LIB] Identifying description sources across a large corpus of MARC records
[I also put this on AUTOCAT. Apologies if you also follow that. This falls at the intersection of hand-cataloging, data processing and simple AI.]
I wonder if anyone has thoughts on the best way to identify the source of summary/description data (520s) across a large corpus of MARC records?
My primary goal is to distinguish between more neutral, librarian-written summaries, and the more promotional summaries derived from publishers sources, whether typed in from flap copy or produced by ONIX-MARC conversion. I can see a number of uses for this distinction; one is that members of LibraryThing much prefer short, neutral descriptions, and abhor the lengthy purple prose of many publisher descriptions.
So far I have a few ideas, but I'd love your thoughts on more:
The 520 $c (and $u, $2) ought to have source information. But it's rarely filled out. Are there any other "tells"?
Some catalogers put "-- Publisher" or "-- Publisher description" at the end of a 520 that comes from a publisher. I'd be interested in hearing about other conventions you use or know.
I could compare the MARC descriptions I have with similar data from Ingram, Amazon and Bowker, which (mostly) come from publishers. If they match, it's probably publisher provided. (All this ignores summaries that come from non-library, non-publisher sources.)
100% accuracy will no doubt elude me, but if I can identify a large set of both publisher and non-publisher, I can perhaps use them as a training set for a Bayesian filter. It's probably that certain words mark something out as publisher-derived—"much-anticipated,"
"bestselling," "seminal," etc.
There are legal or OCLC-annoying issues with re-sharing the descriptions outright, but we'd be more than willing to what we conclude about summaries with the larger cataloging world, such as via hash.
What do you think?
PS: It's probably too far outside my goals, but it would be interesting to go farther. As a former classics scholar-in-training, I'm tempted to try to automate the creation of a text-critical stemma--a tree built by analyzing accumulated changes--and try to peg the versions to their contributing institutions. Could be cool, no?