Great stuff Eric.
I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO release https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
It includes mention of MorphAdorner[1] which does some clever stuff around tagging parts of speech, spelling variations, lemmata etc. and another tool which I hadn’t come across before AnnoLex[2] "for the correction and annotation of lexical data in Early Modern texts”.
This paper[3] from Alistair Baron and Andrew Hardie at the University of Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis may also be of interest, and the team at Lancaster have developed a tool called VARD which supports pre-processing texts[4]
Owen
[1] http://morphadorner.northwestern.edu
[2] http://annolex.at.northwestern.edu
[3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
[4] http://ucrel.lancs.ac.uk/vard/about/
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936
> On 7 Jun 2015, at 18:48, Eric Lease Morgan <[log in to unmask]> wrote:
>
> Here some of developments with my playing with the EEBO data.
>
> I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each "record". Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search result!
> s into a browsable HTML table. [13] The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.
>
> For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.
>
> My next steps are multi-faceted and presented in the following incomplete unordered list:
>
> * create browsable lists - the TEI metadata is clean and
> consistent. The authors and subjects lend themselves very well to
> the creation of browsable lists.
>
> * CGI interface - The ability to search via Web interface is
> imperative, and indexing is a prerequisite.
>
> * transform into HTML - TEI/XML is cool, but…
>
> * create sets - The collection as a whole is very interesting,
> but many scholars will want sub-sets of the collection. I will do
> this sort of work, akin to my work with the HathiTrust. [16]
>
> * do text analysis - This is really the whole point. Given the
> full text combined with the inherent functionality of a computer,
> additional analysis and interpretation can be done against the
> corpus or its subsets. This analysis can be based the counting of
> words, the association of themes, parts-of-speech, etc. For
> example, I plan to give each item in the collection a colors,
> “big” names, and “great” ideas coefficient. These are scores
> denoting the use of researcher-defined “themes”. [17, 18, 19] You
> can see how these themes play out against the complete writings
> of “Dead White Men With Three Names”. [20, 21, 22]
>
> Fun with TEI/XML, text mining, and the definition of librarianship.
>
>
> [1] Box - http://bit.ly/1QcvxLP
> [2] mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
> [3] xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
> [4] catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
> [5] search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
> [6] Baxter at VIAF - http://viaf.org/viaf/54178741
> [7] Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510
> [8] Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter
> [9] box plot of dates - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
> [10] box plot of words - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
> [11] histogram of dates - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
> [12] histogram of words - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
> [13] HTML - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
> [14] Shakespeare - http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
> [15] astronomy - http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
> [16] HathiTrust work - http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
> [17] colors - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
> [18] “big” names - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
> [19] “great” ideas - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
> [20] Thoreau - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
> [21] Emerson - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
> [22] Channing - http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html
>
>
> —
> Eric Lease Morgan, Librarian
> University of Notre Dame
|