> > (d) Has thought been put into making them data archive-friendly?
>
> I don't understand. In this case, what does "archive-friendly" mean?
Well there are two options here:
(a) pre-harvest archiving, maybe you push URLs into archive.org (or
similar) as you harvest them, giving you reproducability and
(b) post-harvest archiving probably implies changing the format of the
resulting file to a standard. Possible standards include ePub, WARC or
METS, depending on your vision for the project. Alternatively, work
with a research data archive to include some basic metadata in the
current zip in a format they can understand and unpack on ingest.
Oh, and the 'About your study carrel' needs a colophon with links to
the software, version, etc.
cheers
stuart
--
...let us be heard from red core to black sky
On Fri, 14 Jun 2019 at 04:03, Eric Lease Morgan <[log in to unmask]> wrote:
>
> On Jun 12, 2019, at 8:40 PM, Stuart A. Yeates <[log in to unmask]> wrote:
>
> >> The Distant Reader [0] harvests an arbitrary number of user-supplied files or links to files, transforms them into plain text files, and performs numerous natural language processes against them. The result is a large set of indexes that can be used to "read" the given corpus. I have made available the about pages of a number of such indexes:
> >>
> >> * Code4Lib Journal - http://dh.crc.nd.edu/tmp/code4lib-journal/about.html
> >> o 1,234,348 words; 303 documents
> >> o all articles from a journal named Code4Lib Journal
> >
> > Taking a look at distant reader (which I don't believe I've looked at before):
> >
> > (a) It would be great to sanity-check the corpus by running language
> > identification on each of the files
>
> Stuart, thank you for the feedback. As of right now, the Distant Reader is only designed to process English language materials. Since it (I) rely on a Python module called spaCy to do the part-of-speech and named-entity extraction, I ought to be able to handle other Romance languages without too much difficulty. [1]
>
>
> > (b) There are a whole flotilla of technical identifiers that could
> > useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc)
>
> This is a fun idea, and I will investigate it further.
>
>
> > (c) A little webification of the texts would go a long way
>
> Hmmm... The plain text versions of the documents are necessary for the natural language processing, but instead of returning links to the plain text I could return links to the cached versions of the texts which are usually formatted in HTML or as PDF. Thus, a part of the reading process would be made easier.
>
>
> > (d) Has thought been put into making them data archive-friendly?
>
> I don't understand. In this case, what does "archive-friendly" mean?
>
>
> For a good time, I created a new data set -- 460 love stories (238 million words; 460 documents; 5.94 uncompressed GB)
>
> * about page - http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories/about.html
> * data set ("study carrel") - http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories.zip
>
> Again, thank you for the feedback.
>
>
> [0] Distant Reader - https://distantreader.org
> [1] spaCy - https://spacy.io/models
>
> --
> Eric Lease Morgan
> Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
> Hesburgh Libraries
>
> University of Notre Dame
> 250E Hesburgh Library
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]
> w: cds.library.nd.edu
|