And one more (tiny, compared to edsu's) data point. You can see the $3
values from over 10,000 records that had 856 fields from an original 1
million records from the UC Berkeley catalog here:
<http://roytennant.com/proto/856/?string=%243>
in all of it's, uh, gory detail. But I agree that there is some low hanging
fruit here. It wouldn't take a rocket scientist (heck, even I can figure
this out) to do a case insensitive string match on "table of contents", for
example. But Michael's point still stands -- this is an uncontrolled field,
so it can get messy pretty quickly. In the end, I think if we focus on the
20 percent that we can do something useful with we might just get an 80
percent return. After all, in Ed's list, taking the first half-a-dozen items
and variations on "PDF" would cover probably 99% of the cases.
Roy
On Wed, Jul 7, 2010 at 9:28 PM, Ed Summers <[log in to unmask]> wrote:
> On Wed, Jul 7, 2010 at 7:00 PM, Doran, Michael D <[log in to unmask]> wrote:
> > Of course, subfield $3 values are not any kind of controlled vocabulary,
> so it's hard to do much with them programmatically.
>
> A few years ago I analyzed the subfield 3 values in the Library of
> Congress data up at the Internet Archive [1]. Of course it's really
> simple to extract, but I just pushed it up to GitHub, mainly to share
> the results [2].
>
> I extracted all the subfield 3 values from the 12M? records, and then
> counted them up to see how often they repeated [3]. As you can see
> it's hardly controlled, but it might be worthwhile coming up with some
> simple heuristics and properties for the familiar ones: you could
> imagine dcterms:description being used for "Publisher description",
> etc.
>
> Of course the $3 in your catalog data might be different from LCs, but
> maybe we could come up with a list of common ones on a wiki somewhere,
> and publish a little vocabulary that covered the important relations?
>
> //Ed
>
> [1] http://www.archive.org/details/marc_records_scriblio_net
> [2] http://github.com/edsu/beat
> [3] http://github.com/edsu/beat/raw/master/types.txt
>
|