Print

Print



On May 7, 2019, at 11:59 AM, Junior Tidal <[log in to unmask]> wrote:

> The newest issue of code4Lib Journal is now available - https://journal.code4lib.org/issues/issues/issue44


For a good time, I fed all of the articles from the current Code4Lib Journal to my application/system called the Distant Reader. [1] The Reader outputs somewhat interesting results. For example, here is the list of alphabetic keywords:

  app; apps; archival; archives; archivesspace; audio; book;
  books; collection; collections; data; date; dates; develop;
  developed; developer; developers; developing; development;
  digital; digitization; digitized; display; displaying;
  displays; editor; editors; email; emailing; emails;
  equipment; field; fields; http; https; libraries; library;
  mobile; mobiles; mobilization; mobilizing; new; news;
  problem; problems; project; projects; record; recorded;
  recorder; recording; recordings; records; repositories;
  repository; search; searches; searching; software; space;
  spaces; studies; studio; study; technical; tei; title;
  titles; user; users; video; videos; widget; widgets

Not too surprising.

Here is a list of automatically generated textual summaries:

  * by Raffaele Viglianti, Marcus Emmanuel Barnes, Natkeeran
    Ledchumykanthan, Kirsta Stapelfeldt Introduction Early Modern
    Songscapes (EMS) is an interdisciplinary web project co-developed
    by the University of Toronto Scarborough’s Digital Scholarship
    Unit (DSU), the University of Maryland (including the Maryland
    Institute for technology in the Humanities), and the University
    of South Carolina.

  * Building on a method created by the Orbis Cascade Alliance, we
    built a Google form that allows users to report problems
    connecting to full text (or any other issue) and automatically
    includes the permalink in their response. We soon realized that
    we could improve the user experience by automatically forwarding
    these reports into our Ask a Librarian email service (LibAnswers)
    so we could offer alternative solutions while we worked on fixing
    the initial issue.

  * Without a practical solution for an academic library, East
    Tennessee State University developed an automated process to
    generate book widgets utilizing data from Alma Analytics. Our
    efforts of creating a book slider for each subject guide relied
    on separate Alma analytics reports and scripts for automation. In
    order to reduce the amount of repeat cURL calls every time the
    import process occurs, we started storing the cover image URL,
    current date, and the corresponding MMS_ID (Alma record ID) in a
    separate array file.

  * In the case of this study, search logs were analyzed to
    understand the popularity of mobile device types used (e.g., iOS
    and Android devices), the nature of search terms in mobile
    searches (the number of words per query), and VuFind search
    facets used and not used in mobile search of the library catalog.
    Logs of both native and responsive mobile apps can be compared
    for search terms used, search query length, and relative
    popularity of each access type across the state, e.g., baseline
    data about the use of each.

  * We began discussing what services we would actually need to
    support search, display, and data backup and integrity in a more
    distributed ecosystem, and began seriously considering a solution
    based on the microservices architectural model, where multiple
    single-purpose systems are integrated to provide the
    functionality of a larger piece of multi-purpose software. Our
    digital collections and preservation librarian explored and
    implemented FITS in her file characterization workflows to enable
    us to generate and store technical metadata.

  * The primary focus of this equipment request was to: Obtain
    equipment necessary to utilize Penn State’s One Button Studio
    software Allow for the use of a teleprompter Slim down the
    lighting set-up with compact LED light panels Provide some sound
    proofing of the space with sound treatment foam Bolster the audio
    recording capabilities with a mix board Add a backdrop to make
    video recordings more professional and visually appealing Table
    4.

  * When authors contribute to the journal, they are given an
    optional demographic survey that collects information including
    ethnicity/cultural identity, gender, location (country),
    disability, and institution type. Data that has been disclosed
    indicates that a majority of authors (52%) identify as white (see
    figure 1), with the second most largest group that responded to
    the survey indicated “no response.” Contributors’ self-reported
    gender identities show that this has slightly changed.

  * We first heard of the Timewalk date parsing plugin for
    ArchivesSpace from discussions with fellow members of the user
    community, and we considered ways we could use the plugin to work
    with legacy data.[8] We knew that the parsing function was
    triggered upon saving a record in the ArchivesSpace staff
    interface, so we attempted, in a test instance, to GET/POST a
    number of records via the API without making any changes.

What is discussed in the issue can be enumerated by looking at the lemmatized nouns:

  alma; app; archivesspace; article; book; collection; datum;
  device; display; equipment; experience; field; figure; format;
  information; interface; library; log; microphone; primo; problem;
  process; project; record; report; result; script; search;
  service; software; solution; source; space; staff; studio; study;
  system; tei; text; time; title; type; university; use; user;
  video; web; widget; work

The actions in articles are enumerated by the verbs:

  add; allow; analyze; base; be; begin; build; call; choose; come;
  consider; contain; create; develop; do; exist; find; follow;
  generate; get; have; help; identify; implement; improve; include;
  look; make; need; offer; present; provide; publish; receive;
  record; report; require; run; see; set; show; store; support;
  take; understand; use; want; work; write

Things and actions in the articles are described with adjectives:

  able; academic; additional; audio; available; creative; current;
  dedicated; different; digital; early; easy; electronic; few;
  first; free; full; future; general; good; great; initial; large;
  many; mobile; more; most; multiple; native; new; next; old; open;
  other; own; possible; primary; repository; responsive; same;
  several; similar; single; specific; subject; such; technical;
  top; useful

When I topic model on the issue and request a single word with a single dimension, then the resulting word is "library". Duh! Since there were eight articles, I topic modeled on eight topics with three dimensions, and this is what I got:

  * digital collections repository 
  * archivesspace records dates 
  * library users time 
  * tei project mei 
  * space equipment software 
  * mobile search native 
  * widget alma books 
  * editors journal data 

I then visualized the topics in the attached image, and you can observe two things:

  1. each article discusses a single topic; each article is dominated by a single color
  2. the topic of "library users times" (the green topic) is mentioned by each article

The Distant Reader is functional, but not necessarily usable. It does not crash nor does it output bogus data. I can feed the Reader a single URL, such as the root of Planet Code4Lib, and it will then consume the whole thing (hundreds and hundreds of URLs) summarizing the results. I can query a local, full text index of Project Gutenberg, find all the writings of Longfellow, Emerson, Melville, or Thoreau, generate a list of URLs pointing to the found items, feed them to the Reader, and consume them too. [3]

Again, the Distant Reader is functional but not necessarily usable. It creates a "study carrel" in the form of a .zip file, as well as a summary file. [4, 5] The study carrel includes bunches o' other files which can be analyzed in OpenRefine, queried through SQL, topic modeled, read "closely" because the original documents are included in the archive, etc. Learning how to use the "study carrel" can be a challenge, and a few things are in the pipeline: writing documentation, enhancing the job submitting process, writing a Web interface to query the study carrel, and including tools in the study carrel for using it accordingly. 

Finally, for the sysops and application programmers in the room, The Distant Reader employs high performance computing techniques. Jobs are submitted to a head node, virtual machines are spun up, the jobs are executed, the machines are spun down, and the output is made available. Each virtual machine has 10 CPUs, plenty of RAM, and more disk space than I can use. The whole thing is embarrassingly parallel, and my cores are almost always max'ed out. In total, the system has 144 cores at my disposal, and it is all sponsored by the good people at XSEDE.† 

Fun with HPC to do NLP.

[Eric goes off to "read" the whole of a newspaper called the Catholic Worker. [6]]


† This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Development of the Apache Airavata used to develop the science gateway is supported by NSF award #1339774. XSEDE resources used include JetStream and ECSS support.

[1] The Distant Reader - https://distantreader.org/pages/about
[2] Planet Code4Lib - https://planet.code4lib.org
[3] Longfellow, Emerson, Melville, or Thoreau - https://ntrda.me/2H3RsE0
[4] Code4Lib Issue #44 "study carrel" - http://dh.crc.nd.edu/tmp/code4lib-issue44.zip
[5] Code4Lib Issue #44 summary - http://dh.crc.nd.edu/tmp/code4lib-issue44.txt
[6] Catholic Worker - http://bit.ly/2VnSjsY

--
Eric Lease Morgan
University of Notre Dame