I think you might be able create an analysis to gauge whether a summary is likely to be neutral.  But I would resist labeling a summary source as "publisher" or "librarian" without actual evidence of the summary author.

As an aside, I will note that RDA Beta gives the ability to provide a source for any piece of metadata within a record.  But of course, MARC is not quite as flexible.

					Steve McDonald
					[log in to unmask]

-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Tim Spalding
Sent: Thursday, September 19, 2019 12:49 PM
To: [log in to unmask]
Subject: [CODE4LIB] Identifying description sources across a large corpus of MARC records

[I also put this on AUTOCAT. Apologies if you also follow that. This falls at the intersection of hand-cataloging, data processing and simple AI.]

I wonder if anyone has thoughts on the best way to identify the source of summary/description data (520s) across a large corpus of MARC records?

My primary goal is to distinguish between more neutral, librarian-written summaries, and the more promotional summaries derived from publishers sources, whether typed in from flap copy or produced by ONIX-MARC conversion. I can see a number of uses for this distinction; one is that members of LibraryThing much prefer short, neutral descriptions, and abhor the lengthy purple prose of many publisher descriptions.

So far I have a few ideas, but I'd love your thoughts on more:

The 520 $c (and $u, $2) ought to have source information. But it's rarely filled out. Are there any other "tells"?

Some catalogers put "-- Publisher" or "-- Publisher description" at the end of a 520 that comes from a publisher. I'd be interested in hearing about other conventions you use or know.

I could compare the MARC descriptions I have with similar data from Ingram, Amazon and Bowker, which (mostly) come from publishers. If they match, it's probably publisher provided. (All this ignores summaries that come from non-library, non-publisher sources.)

100% accuracy will no doubt elude me, but if I can identify a large set of both publisher and non-publisher, I can perhaps use them as a training set for a Bayesian filter. It's probably that certain words mark something out as publisher-derived—"much-anticipated,"
"bestselling," "seminal," etc.

There are legal or OCLC-annoying issues with re-sharing the descriptions outright, but we'd be more than willing to what we conclude about summaries with the larger cataloging world, such as via hash.

What do you think?

Tim Spalding

PS: It's probably too far outside my goals, but it would be interesting to go farther. As a former classics scholar-in-training, I'm tempted to try to automate the creation of a text-critical stemma--a tree built by analyzing accumulated changes--and try to peg the versions to their contributing institutions. Could be cool, no?