And some of the researchers definitely care about this (authority control, high quality descriptive metadata). I went to a hack day focussing on the EEBO-TCP Phase 1 release (these texts). I mentioned to one of the researchers (not a librarian) that I had access to some MARC records which described the works. Their immediate response was “Ah - but which MARC records, because they aren’t all of the same quality”!

There are good cataloguing records for the works but they have not been made available under an open licence alongside the transcribed texts. Probably the highest quality records are those in the English Short Title Catalogue (ESTC)

There have been some great steps forward in the last few years, but I still feel libraries need to increase the amount they are doing to publish metadata under explicitly open licences.


Owen Stephens
Owen Stephens Consulting
Email: [log in to unmask]
Telephone: 0121 288 6936

> On 8 Jun 2015, at 23:23, Stuart A. Yeates <[log in to unmask]> wrote:
> Another thing that could usefully be done is significantly better authority
> control. Authors, works, geographical places, subjects, etc, etc.
> Good core librarianship stuff that is essentially orthogonal to all the
> other work that appears to be happening.
> cheers
> stuart
> --
> ...let us be heard from red core to black sky
> On Tue, Jun 9, 2015 at 12:42 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>> On Jun 8, 2015, at 7:32 AM, Owen Stephens <[log in to unmask]> wrote:
>>> I’ve just seen another interesting take based (mainly) on data in the
>> TCP-EEBO release:
>>> It includes mention of MorphAdorner[1] which does some clever stuff
>> around tagging parts of speech, spelling variations, lemmata etc. and
>> another tool which I hadn’t come across before AnnoLex[2] "for the
>> correction and annotation of lexical data in Early Modern texts”.
>>> This paper[3] from Alistair Baron and Andrew Hardie at the University of
>> Lancaster in the UK about preparing EEBO-TCP texts for corpus-based
>> analysis may also be of interest, and the team at Lancaster have developed
>> a tool called VARD which supports pre-processing texts[4]
>>> [1]
>>> [2]
>>> [3]
>>> [4]
>> All of this is really very interesting. Really. At the same time, there
>> seems to be a WHOLE lot of effort spent on cleaning and normalizing data,
>> and very little done to actually analyze it beyond “close reading”. The
>> final goal of all these interfaces seem to be refined search. Frankly, I
>> don’t need search. And the only community who will want this level of
>> search will be the scholarly scholar. “What about the undergraduate
>> student? What about the just more than casual reader? What about the
>> engineer?” Most people don’t know how or why parts-of-speech are important
>> let alone what a lemma is. Nor do they care. I can find plenty of things. I
>> need (want) analysis. Let’s assume the data is clean — or rather, accept
>> the fact that there is dirty data akin to the dirty data created through
>> OCR and there is nothing a person can do about it — lets see some automated
>> comparisons between texts. Examples might include:
>>  * this one is longer
>>  * this one is shorter
>>  * this one includes more action
>>  * this one discusses such & such theme more than this one
>>  * so & so theme came and went during a particular time period
>>  * the meaning of this phrase changed over time
>>  * the author’s message of this text is…
>>  * this given play asserts the following facts
>>  * here is a map illustrating where the protagonist went when
>>  * a summary of this text includes…
>>  * this work is fiction
>>  * this work is non-fiction
>>  * this work was probably influenced by…
>> We don’t need perfect texts before analysis can be done. Sure, perfect
>> texts help, but they are not necessary. Observations and generalization can
>> be made even without perfectly transcribed texts.
>> —
>> ELM