Print

Print


Another thing that could usefully be done is significantly better authority
control. Authors, works, geographical places, subjects, etc, etc.

Good core librarianship stuff that is essentially orthogonal to all the
other work that appears to be happening.

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Jun 9, 2015 at 12:42 AM, Eric Lease Morgan <[log in to unmask]> wrote:

> On Jun 8, 2015, at 7:32 AM, Owen Stephens <[log in to unmask]> wrote:
>
> > I’ve just seen another interesting take based (mainly) on data in the
> TCP-EEBO release:
> >
> >
> https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
> >
> > It includes mention of MorphAdorner[1] which does some clever stuff
> around tagging parts of speech, spelling variations, lemmata etc. and
> another tool which I hadn’t come across before AnnoLex[2] "for the
> correction and annotation of lexical data in Early Modern texts”.
> >
> > This paper[3] from Alistair Baron and Andrew Hardie at the University of
> Lancaster in the UK about preparing EEBO-TCP texts for corpus-based
> analysis may also be of interest, and the team at Lancaster have developed
> a tool called VARD which supports pre-processing texts[4]
> >
> > [1] http://morphadorner.northwestern.edu
> > [2] http://annolex.at.northwestern.edu
> > [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
> > [4] http://ucrel.lancs.ac.uk/vard/about/
>
>
> All of this is really very interesting. Really. At the same time, there
> seems to be a WHOLE lot of effort spent on cleaning and normalizing data,
> and very little done to actually analyze it beyond “close reading”. The
> final goal of all these interfaces seem to be refined search. Frankly, I
> don’t need search. And the only community who will want this level of
> search will be the scholarly scholar. “What about the undergraduate
> student? What about the just more than casual reader? What about the
> engineer?” Most people don’t know how or why parts-of-speech are important
> let alone what a lemma is. Nor do they care. I can find plenty of things. I
> need (want) analysis. Let’s assume the data is clean — or rather, accept
> the fact that there is dirty data akin to the dirty data created through
> OCR and there is nothing a person can do about it — lets see some automated
> comparisons between texts. Examples might include:
>
>   * this one is longer
>   * this one is shorter
>   * this one includes more action
>   * this one discusses such & such theme more than this one
>   * so & so theme came and went during a particular time period
>   * the meaning of this phrase changed over time
>   * the author’s message of this text is…
>   * this given play asserts the following facts
>   * here is a map illustrating where the protagonist went when
>   * a summary of this text includes…
>   * this work is fiction
>   * this work is non-fiction
>   * this work was probably influenced by…
>
> We don’t need perfect texts before analysis can be done. Sure, perfect
> texts help, but they are not necessary. Observations and generalization can
> be made even without perfectly transcribed texts.
>
> —
> ELM
>