Print

Print


On Aug 9, 2018, at 3:06 PM, Ron PETERSON <[log in to unmask]> wrote:

> The Code4Lib Journal, Issue 41 is now available!... https://journal.code4lib.org/issues/issue41


I love our journal. code4lib++

For a number of reasons I have been working on a system I call the Distant Reader. [1] Given a URL, a set of URLs, a file, or a set of files, the Distant Reader will create a corpus, transform it into plain text, perform a number of text mining & natural language processing techniques against the corpus, and save the result in a number of different formats (semantic index, tab-delimited text files, and relational database). The purpose of the Distant Reader is akin to the purpose of a book's table-of-contents or back-of-the-book index; the Distant Reader enables a person to get a feel for a relatively large corpora -- do "distant reading".

For a good time, I fed the Distant Reader the root URL of the Journal's current issue. It harvested the home page, all the articles, and a few ancillary links. It did its good works, and I've made the results temporarily available on a different computer. [2] From there I get a number of things:

  * a summary
  * a cache of the corpus
  * a set of plain text files created from the cache
  * all the words in all the documents, plus their parts-of-speech
  * a set of all the named-entities from all the documents
  * a list of all the URLs found in the documents
  * email address
  * keywords
  * the most rudimentary of bibliographics
  * a Word2Vec semantic index
  * a relational database of the whole thing
  * a zip file of the whole thing

From result, I believe the keywords of the issue include: libraries, data, metadata, digital, repositories, collections, digitized, images, projects, access, code, imaging, issues, and management. The most frequent lemmatize nouns are very similar: library, metadata, issue, project, system, access, codelib, image, datum, and collection.

Frequent domains include: www.loc.gov, github.com, doi.org, rightsstatements.org, worldcat.org, and en.wikipedia.org.

What do we do in our journal? We be, have, use, do, make, create, provide, work, include, and require.

Software is never done. If it were, then it would be hardware. That said, the Distant Reader is fun. It enables me to "read" the whole of a book, all the things authored by a given individual, all the things in my Zotero database, a whole blog, etc. IMHO, the problem to solve now-a-days is not necessarily find, but how to make the found stuff usable & understandable. 


[1] Distant Reader - https://github.com/ericleasemorgan/reader
[2] results - http://dh.crc.nd.edu/tmp/issue-041/

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
w: cds.library.nd.edu