I like the concept and think the signals you're already looking at (length, evaluative adjectives, etc.) are a solid way to go. If you haven't already looked at several thousand 520s to see what jumps out, I'd definitely do that. I would expect a combo of publisher and date of imprint (perhaps also 490/830 series) would correlate strongly with 520 source. 040 may or may not contain hints. <soapbox>That it's an issue to share/repurpose metadata created for the express purpose of facilitating free dissemination of the resource it describes seems schizophrenic. </soapbox> kyle On Thu, Sep 19, 2019, 09:50 Tim Spalding <[log in to unmask]> wrote: > [I also put this on AUTOCAT. Apologies if you also follow that. This > falls at the intersection of hand-cataloging, data processing and > simple AI.] > > I wonder if anyone has thoughts on the best way to identify the source > of summary/description data (520s) across a large corpus of MARC > records? > > My primary goal is to distinguish between more neutral, > librarian-written summaries, and the more promotional summaries > derived from publishers sources, whether typed in from flap copy or > produced by ONIX-MARC conversion. I can see a number of uses for this > distinction; one is that members of LibraryThing much prefer short, > neutral descriptions, and abhor the lengthy purple prose of many > publisher descriptions. > > So far I have a few ideas, but I'd love your thoughts on more: > > The 520 $c (and $u, $2) ought to have source information. But it's > rarely filled out. Are there any other "tells"? > > Some catalogers put "-- Publisher" or "-- Publisher description" at > the end of a 520 that comes from a publisher. I'd be interested in > hearing about other conventions you use or know. > > I could compare the MARC descriptions I have with similar data from > Ingram, Amazon and Bowker, which (mostly) come from publishers. If > they match, it's probably publisher provided. (All this ignores > summaries that come from non-library, non-publisher sources.) > > 100% accuracy will no doubt elude me, but if I can identify a large > set of both publisher and non-publisher, I can perhaps use them as a > training set for a Bayesian filter. It's probably that certain words > mark something out as publisher-derived—"much-anticipated," > "bestselling," "seminal," etc. > > There are legal or OCLC-annoying issues with re-sharing the > descriptions outright, but we'd be more than willing to what we > conclude about summaries with the larger cataloging world, such as via > hash. > > What do you think? > > Tim Spalding > LibraryThing > > PS: It's probably too far outside my goals, but it would be > interesting to go farther. As a former classics scholar-in-training, > I'm tempted to try to automate the creation of a text-critical > stemma--a tree built by analyzing accumulated changes--and try to peg > the versions to their contributing institutions. Could be cool, no? >