Taking a look at distant reader (which I don't believe I've looked at before):

(a) It would be great to sanity-check the corpus by running language
identification on each of the files

(b) There are a whole flotilla of technical identifiers that could
useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc)

(c) A little webification of the texts would go a long way

(d) Has thought been put into making them data archive-friendly?


...let us be heard from red core to black sky

On Thu, 13 Jun 2019 at 08:51, Eric Lease Morgan <[log in to unmask]> wrote:
> Through the use of my tool called the Distant Reader, I have refined a process for indexing things like Code4Lib Journal. [1]
> The Distant Reader harvests an arbitrary number of user-supplied files or links to files, transforms them into plain text files, and performs numerous natural language processes against them. The result is a large set of indexes that can be used to "read" the given corpus. I have made available the about pages of a number of such indexes:
>   * Code4Lib Journal -
>      o 1,234,348 words; 303 documents
>      o all articles from a journal named Code4Lib Journal
>   * Cultural Analytics -
>      o 318,287 words; 33 documents
>      o all articles from a journal named Cultural Analytics
>   * Plato -
>      o 929,704 words; 24 documents
>      o the complete works of Plato
>   * aesthetics -
>      o 2,296,890 words; 37 documents
>      o books classified as the philosophy of art
> At an upcoming high performance computing conference, I -- with a number of colleagues from Indiana University -- will be presenting a poster about the Distant Reader, and we will be taking part in a hack-a-thon. [2, 3] If you too would like hack against the output of the Distant Reader, then drop me a line.
> [1] Distant Reader -
> [2] high performance computing conference -
> [3] hack-a-thon invitation -
> --
> Eric Lease Morgan
> Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
> Hesburgh Libraries
> University of Notre Dame
> 250E Hesburgh Library
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]
> w: