Here some of developments with my playing with the EEBO data. 

I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each "record". Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search results into a browsable HTML table. [13] The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML. 

For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete unordered list:

  * create browsable lists - the TEI metadata is clean and
    consistent. The authors and subjects lend themselves very well to
    the creation of browsable lists.

  * CGI interface - The ability to search via Web interface is
    imperative, and indexing is a prerequisite.

  * transform into HTML - TEI/XML is cool, but…

  * create sets - The collection as a whole is very interesting,
    but many scholars will want sub-sets of the collection. I will do
    this sort of work, akin to my work with the HathiTrust. [16]

  * do text analysis - This is really the whole point. Given the
    full text combined with the inherent functionality of a computer,
    additional analysis and interpretation can be done against the
    corpus or its subsets. This analysis can be based the counting of
    words, the association of themes, parts-of-speech, etc. For
    example, I plan to give each item in the collection a colors,
    “big” names, and “great” ideas coefficient. These are scores
    denoting the use of researcher-defined “themes”. [17, 18, 19] You
    can see how these themes play out against the complete writings
    of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.

 [1] Box -
 [2] mirror -
 [3] xpath script -
 [4] catalog (index) -
 [5] search results -
 [6] Baxter at VIAF -
 [7] Baxter at WorldCat -
 [8] Baxter at Wikipedia -
 [9] box plot of dates -
[10] box plot of words -
[11] histogram of dates -
[12] histogram of words -
[13] HTML -
[14] Shakespeare -
[15] astronomy -
[16] HathiTrust work -
[17] colors -
[18] “big” names -
[19] “great” ideas -
[20] Thoreau -
[21] Emerson -
[22] Channing -

Eric Lease Morgan, Librarian
University of Notre Dame