LISTSERV 16.5 - CODE4LIB Archives

[I also put this on AUTOCAT. Apologies if you also follow that. This
falls at the intersection of hand-cataloging, data processing and
simple AI.]

I wonder if anyone has thoughts on the best way to identify the source
of summary/description data (520s) across a large corpus of MARC
records?

My primary goal is to distinguish between more neutral,
librarian-written summaries, and the more promotional summaries
derived from publishers sources, whether typed in from flap copy or
produced by ONIX-MARC conversion. I can see a number of uses for this
distinction; one is that members of LibraryThing much prefer short,
neutral descriptions, and abhor the lengthy purple prose of many
publisher descriptions.

So far I have a few ideas, but I'd love your thoughts on more:

The 520 $c (and $u, $2) ought to have source information. But it's
rarely filled out. Are there any other "tells"?

Some catalogers put "-- Publisher" or "-- Publisher description" at
the end of a 520 that comes from a publisher. I'd be interested in
hearing about other conventions you use or know.

I could compare the MARC descriptions I have with similar data from
Ingram, Amazon and Bowker, which (mostly) come from publishers. If
they match, it's probably publisher provided. (All this ignores
summaries that come from non-library, non-publisher sources.)

100% accuracy will no doubt elude me, but if I can identify a large
set of both publisher and non-publisher, I can perhaps use them as a
training set for a Bayesian filter. It's probably that certain words
mark something out as publisher-derived—"much-anticipated,"
"bestselling," "seminal," etc.

There are legal or OCLC-annoying issues with re-sharing the
descriptions outright, but we'd be more than willing to what we
conclude about summaries with the larger cataloging world, such as via
hash.

What do you think?

Tim Spalding
LibraryThing

PS: It's probably too far outside my goals, but it would be
interesting to go farther. As a former classics scholar-in-training,
I'm tempted to try to automate the creation of a text-critical
stemma--a tree built by analyzing accumulated changes--and try to peg
the versions to their contributing institutions. Could be cool, no?