LISTSERV 16.5 - CODE4LIB Archives

On Jan 4, 2024, at 11:26 AM, Alison Clemens <[log in to unmask]> wrote:

> Has anyone here done text analysis-type work on MARC data, particularly on
> topical subject headings? I work closely with my library's digital
> collections, and I am interested in seeing what kinds of topics (as
> indicated in our descriptive data) are represented in our
> digital collections. So, I have the corresponding MARCXML for the
> materials and have extracted the 650s as a string (e.g., *650 $a World War,
> 1914-1918 $x Territorial questions $v Maps*), but I'm a little stuck on how
> to meaningfully analyze the data. I tried feeding the data into Voyant, but
> I think it's too large of a corpus to run properly there, and regardless,
> the MARC data is (of course) delimited in a specific way.
> 
> Any / all perspectives or experience would be welcome -- please do get in
> touch directly (at [log in to unmask]), if you'd like.
> 
> --
> Alison Clemens
> Beinecke Rare Book and Manuscript Library, Yale University


The amount of available content, relative to the size of the values in 6xx, is kinda small. The number of things might be large, but the number of result words is small. That said, I can think of a number of ways such analysis can be done. The process can be boiled down to four very broad steps:

  1) articulating more thoroughly what questions you want to ask of the MARC
  2) distilling the MARC into one or more formats amenable to a given modeling/analysis process
  3) modeling/analyzing the data
  4) evaluating the results

For example, suppose you simply wanted to know the frequency of each FAST subject heading. I would loop through each 6xx field in each MARC, extract the given subjects, parse the values into FAST headings, and output the result to a file. You will then have file looking something like this:

  United States
  World War, 1914-1918
  Directories
  Science, Ancient
  Maps
  Librarians
  Origami
  Science, Ancient
  Origami
  Maps
  Philosophy
  Dickens, Charles
  World War, 1914-1918
  Territorial questions
  Maps

Suppose the file is named headings.txt. You can then sort the list, use the Linux uniq command to count and tabulate each heading. Pipe the result to the sort command, and you will end up a with groovy frequency list. The command will look something like this:

  cat headings.txt | sort | uniq -c | sort -rn

Here is the result:

   3 Maps
   2 Science, Ancient
   2 Origami
   1 World War, 1914-1918
   1 Territorial questions
   1 Philosophy
   1 Librarians
   1 Directories
   1 Dickens, Charles

Such a process will give you one view of your data. Relatively quick and easy.

Suppose you wanted to extract latent themes from the content of MARC 6xx. This is sometimes called "topic modeling", and MALLET is the grandaddy of topic modeling tools. Loop through each 6xx field of your MARC records, extract the headings, and for each MARC, create a plain text file containing the data. In the end you will thousands of tiny plain text files. You can then turn MALLET against the files, and the result will a set of weighted themes -- "topics". For extra credit, consider adding the values of 245, 1xx, 5xx to your output. If each plain text file is associated with a metadata value (such as date, collection, format, etc.), then the resulting topic model can be pivoted, and you will be able to observe how the topics compare to the metadata values. For example, you could answer the question, "For items in these formats, what are the more frequent topics?" or "How have our subjects ebbed & flowed over time?"

I do this sort of work all the time; what you are describing is a very large part of my job. Here in our scholarship center people bring me lots o' content, and I use processes very much like the ones outlined above to help the people use & understand it.

Fun!

--
Eric Morgan <[log in to unmask]>
Navari Family Center for Digital Scholarship
University of Notre Dame

574/631-8604