I like the concept and think the signals you're already looking at (length,
evaluative adjectives, etc.) are a solid way to go. If you haven't already
looked at several thousand 520s to see what jumps out, I'd definitely do
I would expect a combo of publisher and date of imprint (perhaps also
490/830 series) would correlate strongly with 520 source. 040 may or may
not contain hints.
<soapbox>That it's an issue to share/repurpose metadata created for the
express purpose of facilitating free dissemination of the resource it
describes seems schizophrenic. </soapbox>
On Thu, Sep 19, 2019, 09:50 Tim Spalding <[log in to unmask]> wrote:
> [I also put this on AUTOCAT. Apologies if you also follow that. This
> falls at the intersection of hand-cataloging, data processing and
> simple AI.]
> I wonder if anyone has thoughts on the best way to identify the source
> of summary/description data (520s) across a large corpus of MARC
> My primary goal is to distinguish between more neutral,
> librarian-written summaries, and the more promotional summaries
> derived from publishers sources, whether typed in from flap copy or
> produced by ONIX-MARC conversion. I can see a number of uses for this
> distinction; one is that members of LibraryThing much prefer short,
> neutral descriptions, and abhor the lengthy purple prose of many
> publisher descriptions.
> So far I have a few ideas, but I'd love your thoughts on more:
> The 520 $c (and $u, $2) ought to have source information. But it's
> rarely filled out. Are there any other "tells"?
> Some catalogers put "-- Publisher" or "-- Publisher description" at
> the end of a 520 that comes from a publisher. I'd be interested in
> hearing about other conventions you use or know.
> I could compare the MARC descriptions I have with similar data from
> Ingram, Amazon and Bowker, which (mostly) come from publishers. If
> they match, it's probably publisher provided. (All this ignores
> summaries that come from non-library, non-publisher sources.)
> 100% accuracy will no doubt elude me, but if I can identify a large
> set of both publisher and non-publisher, I can perhaps use them as a
> training set for a Bayesian filter. It's probably that certain words
> mark something out as publisher-derived—"much-anticipated,"
> "bestselling," "seminal," etc.
> There are legal or OCLC-annoying issues with re-sharing the
> descriptions outright, but we'd be more than willing to what we
> conclude about summaries with the larger cataloging world, such as via
> What do you think?
> Tim Spalding
> PS: It's probably too far outside my goals, but it would be
> interesting to go farther. As a former classics scholar-in-training,
> I'm tempted to try to automate the creation of a text-critical
> stemma--a tree built by analyzing accumulated changes--and try to peg
> the versions to their contributing institutions. Could be cool, no?