Taking a look at distant reader (which I don't believe I've looked at before): (a) It would be great to sanity-check the corpus by running language identification on each of the files (b) There are a whole flotilla of technical identifiers that could useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc) (c) A little webification of the texts would go a long way (d) Has thought been put into making them data archive-friendly? cheers stuart -- ...let us be heard from red core to black sky On Thu, 13 Jun 2019 at 08:51, Eric Lease Morgan <[log in to unmask]> wrote: > > Through the use of my tool called the Distant Reader, I have refined a process for indexing things like Code4Lib Journal. [1] > > The Distant Reader harvests an arbitrary number of user-supplied files or links to files, transforms them into plain text files, and performs numerous natural language processes against them. The result is a large set of indexes that can be used to "read" the given corpus. I have made available the about pages of a number of such indexes: > > * Code4Lib Journal - http://dh.crc.nd.edu/tmp/code4lib-journal/about.html > o 1,234,348 words; 303 documents > o all articles from a journal named Code4Lib Journal > > * Cultural Analytics - http://dh.crc.nd.edu/tmp/cultural-analytics/about.html > o 318,287 words; 33 documents > o all articles from a journal named Cultural Analytics > > * Plato - http://dh.crc.nd.edu/tmp/plato/about.html > o 929,704 words; 24 documents > o the complete works of Plato > > * aesthetics - http://dh.crc.nd.edu/tmp/aesthetics/about.html > o 2,296,890 words; 37 documents > o books classified as the philosophy of art > > At an upcoming high performance computing conference, I -- with a number of colleagues from Indiana University -- will be presenting a poster about the Distant Reader, and we will be taking part in a hack-a-thon. [2, 3] If you too would like hack against the output of the Distant Reader, then drop me a line. > > [1] Distant Reader - https://distantreader.org > [2] high performance computing conference - https://www.pearc19.pearc.org > [3] hack-a-thon invitation - https://sites.nd.edu/emorgan/2019/06/hackathon/ > > -- > Eric Lease Morgan > Digital Initiatives Librarian, Navari Family Center for Digital Scholarship > Hesburgh Libraries > > University of Notre Dame > 250E Hesburgh Library > Notre Dame, IN 46556 > o: 574-631-8604 > e: [log in to unmask] > w: cds.library.nd.edu