NONCONFIDENTIAL // EXTERNAL
Eric,
I don’t think it’s possible through an OAI-PMH request alone, but given a collection alias and item identifying pointer value, you can make a separate request to the CONTENTdm dmGetItemInfo API endpoint and may be able to get the full text in the ‘transc’ (transcription) field there.
Example call for a response in JSON, based on your identifiers below:
https://cdm1224.contentdm.oclc.org/digital/bl/dmwebservices/index.php?q=dmGetItemInfo/p1224coll8/12/json
More info on the dmGetItemInfo endpoint here: https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/API_Reference/CONTENTdm_API/CONTENTdm_Server_API_Functions_dmwebservices#dmGetItemInfo
Hope this helps.
-Matt
From: Code for Libraries <[log in to unmask]> on behalf of Eric Lease Morgan <[log in to unmask]>
Date: Thursday, October 17, 2024 at 12:04 PM
To: [log in to unmask] <[log in to unmask]>
Subject: [External] [CODE4LIB] contentdm and ocred text
NONCONFIDENTIAL // EXTERNAL
PLEASE NOTE: This email is not from a Federal Reserve address.
Do not click on suspicious links. Do not give out personal or bank information to unknown senders.
Given a CONTENTdm item that has been OCRed, is it possible to download the OCRed text, and if so, then what shape does the URL take?
Using OAI-PMH I can list all the records in a CONTENTdm set. Here is an abbreviated, redacted example of a specific record:
<record>
<header>
<identifier>oai:cdm1224.contentdm.oclc.org:p1224coll8/12</identifier>
<datestamp>2015-08-06</datestamp>
<setSpec>p1224coll8</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/..." >
<dc:title>'A' Company underground</dc:title>
<dc:publisher>Co. 'A' Underground</dc:publisher>
<dc:date>1972</dc:date>
<dc:language>English</dc:language>
<dc:coverage>United States</dc:coverage>
<dc:format>XML</dc:format>
<dc:rights>Copyright in most of the documents...</dc:rights>
<dc:source>foo ba</dc:source>
<dc:type>Text; Image</dc:type>
<dc:identifier>foobarNewsletter001000</dc:identifier>
<dc:identifier>http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12</dc:identifier>
</oai_dc:dc>
</metadata>
</record>
There are three identifiers in the record:
oai:cdm1224.contentdm.oclc.org:p1224coll8/12
foobarNewsletter001000
http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12
When I visit the third (and redacted) identifier I am presented with a viewer page. The viewer page offers the opportunity to search. When I search my query terms are highlighted on the scanned image. Thus, I know the item has been OCRed.
Is it possible to reverse-engineer any one of the identifiers, above, to point to the OCR'ed text, and if so, then how?
In the end, I want to download the OCRed text of a given set of digitized content. I will also download the texts' bibliographics. Finally, I will use text mining and natural language processing to evaluate the content, look for patterns, and address a faculty member's research questions.
Using OAI-PMH I can get the bibliographics, but how can I get the OCRed text?
--
Eric Morgan
Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
IMPORTANT: This e-mail message, including attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. If you are not the intended recipient, please immediately contact the sender by replying to the e-mail and destroying all copies of the original message.
|