On Jun 8, 2015, at 7:32 AM, Owen Stephens <[log in to unmask]> wrote:

> I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO release:
> It includes mention of MorphAdorner[1] which does some clever stuff around tagging parts of speech, spelling variations, lemmata etc. and another tool which I hadn’t come across before AnnoLex[2] "for the correction and annotation of lexical data in Early Modern texts”.
> This paper[3] from Alistair Baron and Andrew Hardie at the University of Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis may also be of interest, and the team at Lancaster have developed a tool called VARD which supports pre-processing texts[4]
> [1]
> [2]
> [3]
> [4]

All of this is really very interesting. Really. At the same time, there seems to be a WHOLE lot of effort spent on cleaning and normalizing data, and very little done to actually analyze it beyond “close reading”. The final goal of all these interfaces seem to be refined search. Frankly, I don’t need search. And the only community who will want this level of search will be the scholarly scholar. “What about the undergraduate student? What about the just more than casual reader? What about the engineer?” Most people don’t know how or why parts-of-speech are important let alone what a lemma is. Nor do they care. I can find plenty of things. I need (want) analysis. Let’s assume the data is clean — or rather, accept the fact that there is dirty data akin to the dirty data created through OCR and there is nothing a person can do about it — lets see some automated comparisons between texts. Examples might include:

  * this one is longer
  * this one is shorter
  * this one includes more action
  * this one discusses such & such theme more than this one
  * so & so theme came and went during a particular time period
  * the meaning of this phrase changed over time
  * the author’s message of this text is…
  * this given play asserts the following facts
  * here is a map illustrating where the protagonist went when
  * a summary of this text includes…
  * this work is fiction
  * this work is non-fiction
  * this work was probably influenced by…

We don’t need perfect texts before analysis can be done. Sure, perfect texts help, but they are not necessary. Observations and generalization can be made even without perfectly transcribed texts.