Given a CONTENTdm item that has been OCRed, is it possible to download the OCRed text, and if so, then what shape does the URL take?
Using OAI-PMH I can list all the records in a CONTENTdm set. Here is an abbreviated, redacted example of a specific record:
<record>
<header>
<identifier>oai:cdm1224.contentdm.oclc.org:p1224coll8/12</identifier>
<datestamp>2015-08-06</datestamp>
<setSpec>p1224coll8</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/..." >
<dc:title>'A' Company underground</dc:title>
<dc:publisher>Co. 'A' Underground</dc:publisher>
<dc:date>1972</dc:date>
<dc:language>English</dc:language>
<dc:coverage>United States</dc:coverage>
<dc:format>XML</dc:format>
<dc:rights>Copyright in most of the documents...</dc:rights>
<dc:source>foo ba</dc:source>
<dc:type>Text; Image</dc:type>
<dc:identifier>foobarNewsletter001000</dc:identifier>
<dc:identifier>http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12</dc:identifier>
</oai_dc:dc>
</metadata>
</record>
There are three identifiers in the record:
oai:cdm1224.contentdm.oclc.org:p1224coll8/12
foobarNewsletter001000
http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12
When I visit the third (and redacted) identifier I am presented with a viewer page. The viewer page offers the opportunity to search. When I search my query terms are highlighted on the scanned image. Thus, I know the item has been OCRed.
Is it possible to reverse-engineer any one of the identifiers, above, to point to the OCR'ed text, and if so, then how?
In the end, I want to download the OCRed text of a given set of digitized content. I will also download the texts' bibliographics. Finally, I will use text mining and natural language processing to evaluate the content, look for patterns, and address a faculty member's research questions.
Using OAI-PMH I can get the bibliographics, but how can I get the OCRed text?
--
Eric Morgan
Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
|