>> The Distant Reader [0] harvests an arbitrary number of user-supplied files or links to files, transforms them into plain text files, and performs numerous natural language processes against them. The result is a large set of indexes that can be used to "read" the given corpus. I have made available the about pages of a number of such indexes:
>>  * Code4Lib Journal -
>>     o 1,234,348 words; 303 documents
>>     o all articles from a journal named Code4Lib Journal
> Taking a look at distant reader (which I don't believe I've looked at before):
> (a) It would be great to sanity-check the corpus by running language
> identification on each of the files

Stuart, thank you for the feedback. As of right now, the Distant Reader is only designed to process English language materials. Since it (I) rely on a Python module called spaCy to do the part-of-speech and named-entity extraction, I ought to be able to handle other Romance languages without too much difficulty. [1]

> (b) There are a whole flotilla of technical identifiers that could
> useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc)

This is a fun idea, and I will investigate it further.

> (c) A little webification of the texts would go a long way

Hmmm... The plain text versions of the documents are necessary for the natural language processing, but instead of returning links to the plain text I could return links to the cached versions of the texts which are usually formatted in HTML or as PDF. Thus, a part of the reading process would be made easier.

> (d) Has thought been put into making them data archive-friendly?

I don't understand. In this case, what does "archive-friendly" mean?

For a good time, I created a new data set -- 460 love stories (238 million words; 460 documents; 5.94 uncompressed GB)

  * about page -
  * data set ("study carrel") -

Again, thank you for the feedback.

[0] Distant Reader -
[1] spaCy -

