On May 7, 2019, at 11:59 AM, Junior Tidal <[log in to unmask]> wrote: > The newest issue of code4Lib Journal is now available - https://journal.code4lib.org/issues/issues/issue44 For a good time, I fed all of the articles from the current Code4Lib Journal to my application/system called the Distant Reader. [1] The Reader outputs somewhat interesting results. For example, here is the list of alphabetic keywords: app; apps; archival; archives; archivesspace; audio; book; books; collection; collections; data; date; dates; develop; developed; developer; developers; developing; development; digital; digitization; digitized; display; displaying; displays; editor; editors; email; emailing; emails; equipment; field; fields; http; https; libraries; library; mobile; mobiles; mobilization; mobilizing; new; news; problem; problems; project; projects; record; recorded; recorder; recording; recordings; records; repositories; repository; search; searches; searching; software; space; spaces; studies; studio; study; technical; tei; title; titles; user; users; video; videos; widget; widgets Not too surprising. Here is a list of automatically generated textual summaries: * by Raffaele Viglianti, Marcus Emmanuel Barnes, Natkeeran Ledchumykanthan, Kirsta Stapelfeldt Introduction Early Modern Songscapes (EMS) is an interdisciplinary web project co-developed by the University of Toronto Scarborough’s Digital Scholarship Unit (DSU), the University of Maryland (including the Maryland Institute for technology in the Humanities), and the University of South Carolina. * Building on a method created by the Orbis Cascade Alliance, we built a Google form that allows users to report problems connecting to full text (or any other issue) and automatically includes the permalink in their response. We soon realized that we could improve the user experience by automatically forwarding these reports into our Ask a Librarian email service (LibAnswers) so we could offer alternative solutions while we worked on fixing the initial issue. * Without a practical solution for an academic library, East Tennessee State University developed an automated process to generate book widgets utilizing data from Alma Analytics. Our efforts of creating a book slider for each subject guide relied on separate Alma analytics reports and scripts for automation. In order to reduce the amount of repeat cURL calls every time the import process occurs, we started storing the cover image URL, current date, and the corresponding MMS_ID (Alma record ID) in a separate array file. * In the case of this study, search logs were analyzed to understand the popularity of mobile device types used (e.g., iOS and Android devices), the nature of search terms in mobile searches (the number of words per query), and VuFind search facets used and not used in mobile search of the library catalog. Logs of both native and responsive mobile apps can be compared for search terms used, search query length, and relative popularity of each access type across the state, e.g., baseline data about the use of each. * We began discussing what services we would actually need to support search, display, and data backup and integrity in a more distributed ecosystem, and began seriously considering a solution based on the microservices architectural model, where multiple single-purpose systems are integrated to provide the functionality of a larger piece of multi-purpose software. Our digital collections and preservation librarian explored and implemented FITS in her file characterization workflows to enable us to generate and store technical metadata. * The primary focus of this equipment request was to: Obtain equipment necessary to utilize Penn State’s One Button Studio software Allow for the use of a teleprompter Slim down the lighting set-up with compact LED light panels Provide some sound proofing of the space with sound treatment foam Bolster the audio recording capabilities with a mix board Add a backdrop to make video recordings more professional and visually appealing Table 4. * When authors contribute to the journal, they are given an optional demographic survey that collects information including ethnicity/cultural identity, gender, location (country), disability, and institution type. Data that has been disclosed indicates that a majority of authors (52%) identify as white (see figure 1), with the second most largest group that responded to the survey indicated “no response.” Contributors’ self-reported gender identities show that this has slightly changed. * We first heard of the Timewalk date parsing plugin for ArchivesSpace from discussions with fellow members of the user community, and we considered ways we could use the plugin to work with legacy data.[8] We knew that the parsing function was triggered upon saving a record in the ArchivesSpace staff interface, so we attempted, in a test instance, to GET/POST a number of records via the API without making any changes. What is discussed in the issue can be enumerated by looking at the lemmatized nouns: alma; app; archivesspace; article; book; collection; datum; device; display; equipment; experience; field; figure; format; information; interface; library; log; microphone; primo; problem; process; project; record; report; result; script; search; service; software; solution; source; space; staff; studio; study; system; tei; text; time; title; type; university; use; user; video; web; widget; work The actions in articles are enumerated by the verbs: add; allow; analyze; base; be; begin; build; call; choose; come; consider; contain; create; develop; do; exist; find; follow; generate; get; have; help; identify; implement; improve; include; look; make; need; offer; present; provide; publish; receive; record; report; require; run; see; set; show; store; support; take; understand; use; want; work; write Things and actions in the articles are described with adjectives: able; academic; additional; audio; available; creative; current; dedicated; different; digital; early; easy; electronic; few; first; free; full; future; general; good; great; initial; large; many; mobile; more; most; multiple; native; new; next; old; open; other; own; possible; primary; repository; responsive; same; several; similar; single; specific; subject; such; technical; top; useful When I topic model on the issue and request a single word with a single dimension, then the resulting word is "library". Duh! Since there were eight articles, I topic modeled on eight topics with three dimensions, and this is what I got: * digital collections repository * archivesspace records dates * library users time * tei project mei * space equipment software * mobile search native * widget alma books * editors journal data I then visualized the topics in the attached image, and you can observe two things: 1. each article discusses a single topic; each article is dominated by a single color 2. the topic of "library users times" (the green topic) is mentioned by each article The Distant Reader is functional, but not necessarily usable. It does not crash nor does it output bogus data. I can feed the Reader a single URL, such as the root of Planet Code4Lib, and it will then consume the whole thing (hundreds and hundreds of URLs) summarizing the results. I can query a local, full text index of Project Gutenberg, find all the writings of Longfellow, Emerson, Melville, or Thoreau, generate a list of URLs pointing to the found items, feed them to the Reader, and consume them too. [3] Again, the Distant Reader is functional but not necessarily usable. It creates a "study carrel" in the form of a .zip file, as well as a summary file. [4, 5] The study carrel includes bunches o' other files which can be analyzed in OpenRefine, queried through SQL, topic modeled, read "closely" because the original documents are included in the archive, etc. Learning how to use the "study carrel" can be a challenge, and a few things are in the pipeline: writing documentation, enhancing the job submitting process, writing a Web interface to query the study carrel, and including tools in the study carrel for using it accordingly. Finally, for the sysops and application programmers in the room, The Distant Reader employs high performance computing techniques. Jobs are submitted to a head node, virtual machines are spun up, the jobs are executed, the machines are spun down, and the output is made available. Each virtual machine has 10 CPUs, plenty of RAM, and more disk space than I can use. The whole thing is embarrassingly parallel, and my cores are almost always max'ed out. In total, the system has 144 cores at my disposal, and it is all sponsored by the good people at XSEDE.† Fun with HPC to do NLP. [Eric goes off to "read" the whole of a newspaper called the Catholic Worker. [6]] † This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Development of the Apache Airavata used to develop the science gateway is supported by NSF award #1339774. XSEDE resources used include JetStream and ECSS support. [1] The Distant Reader - https://distantreader.org/pages/about [2] Planet Code4Lib - https://planet.code4lib.org [3] Longfellow, Emerson, Melville, or Thoreau - https://ntrda.me/2H3RsE0 [4] Code4Lib Issue #44 "study carrel" - http://dh.crc.nd.edu/tmp/code4lib-issue44.zip [5] Code4Lib Issue #44 summary - http://dh.crc.nd.edu/tmp/code4lib-issue44.txt [6] Catholic Worker - http://bit.ly/2VnSjsY -- Eric Lease Morgan University of Notre Dame