> > (d) Has thought been put into making them data archive-friendly? > > I don't understand. In this case, what does "archive-friendly" mean? Well there are two options here: (a) pre-harvest archiving, maybe you push URLs into archive.org (or similar) as you harvest them, giving you reproducability and (b) post-harvest archiving probably implies changing the format of the resulting file to a standard. Possible standards include ePub, WARC or METS, depending on your vision for the project. Alternatively, work with a research data archive to include some basic metadata in the current zip in a format they can understand and unpack on ingest. Oh, and the 'About your study carrel' needs a colophon with links to the software, version, etc. cheers stuart -- ...let us be heard from red core to black sky On Fri, 14 Jun 2019 at 04:03, Eric Lease Morgan <[log in to unmask]> wrote: > > On Jun 12, 2019, at 8:40 PM, Stuart A. Yeates <[log in to unmask]> wrote: > > >> The Distant Reader [0] harvests an arbitrary number of user-supplied files or links to files, transforms them into plain text files, and performs numerous natural language processes against them. The result is a large set of indexes that can be used to "read" the given corpus. I have made available the about pages of a number of such indexes: > >> > >> * Code4Lib Journal - http://dh.crc.nd.edu/tmp/code4lib-journal/about.html > >> o 1,234,348 words; 303 documents > >> o all articles from a journal named Code4Lib Journal > > > > Taking a look at distant reader (which I don't believe I've looked at before): > > > > (a) It would be great to sanity-check the corpus by running language > > identification on each of the files > > Stuart, thank you for the feedback. As of right now, the Distant Reader is only designed to process English language materials. Since it (I) rely on a Python module called spaCy to do the part-of-speech and named-entity extraction, I ought to be able to handle other Romance languages without too much difficulty. [1] > > > > (b) There are a whole flotilla of technical identifiers that could > > useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc) > > This is a fun idea, and I will investigate it further. > > > > (c) A little webification of the texts would go a long way > > Hmmm... The plain text versions of the documents are necessary for the natural language processing, but instead of returning links to the plain text I could return links to the cached versions of the texts which are usually formatted in HTML or as PDF. Thus, a part of the reading process would be made easier. > > > > (d) Has thought been put into making them data archive-friendly? > > I don't understand. In this case, what does "archive-friendly" mean? > > > For a good time, I created a new data set -- 460 love stories (238 million words; 460 documents; 5.94 uncompressed GB) > > * about page - http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories/about.html > * data set ("study carrel") - http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories.zip > > Again, thank you for the feedback. > > > [0] Distant Reader - https://distantreader.org > [1] spaCy - https://spacy.io/models > > -- > Eric Lease Morgan > Digital Initiatives Librarian, Navari Family Center for Digital Scholarship > Hesburgh Libraries > > University of Notre Dame > 250E Hesburgh Library > Notre Dame, IN 46556 > o: 574-631-8604 > e: [log in to unmask] > w: cds.library.nd.edu