Great stuff Eric.
I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO release https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
It includes mention of MorphAdorner which does some clever stuff around tagging parts of speech, spelling variations, lemmata etc. and another tool which I hadn’t come across before AnnoLex "for the correction and annotation of lexical data in Early Modern texts”.
This paper from Alistair Baron and Andrew Hardie at the University of Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis may also be of interest, and the team at Lancaster have developed a tool called VARD which supports pre-processing texts
Owen Stephens Consulting
Email: [log in to unmask]
Telephone: 0121 288 6936
> On 7 Jun 2015, at 18:48, Eric Lease Morgan <[log in to unmask]> wrote:
> Here some of developments with my playing with the EEBO data.
> I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each "record". Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file.  I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search result!
> s into a browsable HTML table.  The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.
> For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.
> My next steps are multi-faceted and presented in the following incomplete unordered list:
> * create browsable lists - the TEI metadata is clean and
> consistent. The authors and subjects lend themselves very well to
> the creation of browsable lists.
> * CGI interface - The ability to search via Web interface is
> imperative, and indexing is a prerequisite.
> * transform into HTML - TEI/XML is cool, but…
> * create sets - The collection as a whole is very interesting,
> but many scholars will want sub-sets of the collection. I will do
> this sort of work, akin to my work with the HathiTrust. 
> * do text analysis - This is really the whole point. Given the
> full text combined with the inherent functionality of a computer,
> additional analysis and interpretation can be done against the
> corpus or its subsets. This analysis can be based the counting of
> words, the association of themes, parts-of-speech, etc. For
> example, I plan to give each item in the collection a colors,
> “big” names, and “great” ideas coefficient. These are scores
> denoting the use of researcher-defined “themes”. [17, 18, 19] You
> can see how these themes play out against the complete writings
> of “Dead White Men With Three Names”. [20, 21, 22]
> Fun with TEI/XML, text mining, and the definition of librarianship.
>  Box - http://bit.ly/1QcvxLP
>  mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
>  xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
>  catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
>  search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
>  Baxter at VIAF - http://viaf.org/viaf/54178741
>  Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510
>  Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter
>  box plot of dates - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
>  box plot of words - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
>  histogram of dates - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
>  histogram of words - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
>  HTML - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
>  Shakespeare - http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
>  astronomy - http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
>  HathiTrust work - http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
>  colors - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
>  “big” names - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
>  “great” ideas - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
>  Thoreau - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
>  Emerson - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
>  Channing - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html
> Eric Lease Morgan, Librarian
> University of Notre Dame