LISTSERV 16.5 - CODE4LIB Archives

This is an unsolicited shout-out to the HathiTrust Research Center.

I've played with the things at the HathiTrust Research Center many times and for a long time, and I believe the latest set of features is the most user-friendly --> https://analytics.hathitrust.org/algorithms There used to be many more than three algorithms, but the algorithms the Center does supply work quite well:

1. Extracted Features - Given a collection, get a rsync file allowing you to download JSON files describing each item in the collection. The only thing "wrong" with the resulting JSON is the fact that it is a bag-of-words which mean one can't analyze things on a sentence level. I have, though, written a script to reformat the JSON as if it were a book, and each "paragraph" in the resulting "book" is really a page of data.

2. Named Entity Recognizer - Given a collection, return a list of books, page numbers, entities, and entity types for each item in the collection. In past incarnations of this script it was difficult to download the resulting data set. Downloading is now trivial, and since it is in a CSV format, it plays VERY WELL with OpenRefine.

3. Topic Model Explorer - Given a collection, creates 20, 40, 60, and 80 topics by default, and then provides a means to visualize the result. It also offers a set of downloads (in a strange format but "unzips" anyway) for further analysis.

I haven't played with the Data Capsules for a while, but as you may or may not know, they provide you with a virtual machine where you can work on texts -- much akin a "study carrel" in a library's special collections / rare book room. You can look but you can not take the original things away.

--
Eric Lease Morgan
University of Notre Dame