[I also put this on AUTOCAT. Apologies if you also follow that. This
falls at the intersection of hand-cataloging, data processing and
I wonder if anyone has thoughts on the best way to identify the source
of summary/description data (520s) across a large corpus of MARC
My primary goal is to distinguish between more neutral,
librarian-written summaries, and the more promotional summaries
derived from publishers sources, whether typed in from flap copy or
produced by ONIX-MARC conversion. I can see a number of uses for this
distinction; one is that members of LibraryThing much prefer short,
neutral descriptions, and abhor the lengthy purple prose of many
So far I have a few ideas, but I'd love your thoughts on more:
The 520 $c (and $u, $2) ought to have source information. But it's
rarely filled out. Are there any other "tells"?
Some catalogers put "-- Publisher" or "-- Publisher description" at
the end of a 520 that comes from a publisher. I'd be interested in
hearing about other conventions you use or know.
I could compare the MARC descriptions I have with similar data from
Ingram, Amazon and Bowker, which (mostly) come from publishers. If
they match, it's probably publisher provided. (All this ignores
summaries that come from non-library, non-publisher sources.)
100% accuracy will no doubt elude me, but if I can identify a large
set of both publisher and non-publisher, I can perhaps use them as a
training set for a Bayesian filter. It's probably that certain words
mark something out as publisher-derived—"much-anticipated,"
"bestselling," "seminal," etc.
There are legal or OCLC-annoying issues with re-sharing the
descriptions outright, but we'd be more than willing to what we
conclude about summaries with the larger cataloging world, such as via
What do you think?
PS: It's probably too far outside my goals, but it would be
interesting to go farther. As a former classics scholar-in-training,
I'm tempted to try to automate the creation of a text-critical
stemma--a tree built by analyzing accumulated changes--and try to peg
the versions to their contributing institutions. Could be cool, no?