On Mon, 11 Feb 2019 at 11:51, Eric Lease Morgan <[log in to unmask]> wrote:
> I've finally figured out how to get raw OCR text out of the HathiTrust
> API, but it is really slow. Any hints out there?
> Am I missing something when it comes to the API?
You may have tried this already, but it seems that Hathi also offer PDF-
and EBM-formatted data at the volume level. Do those formats include the
OCR text? I have seen this done in PDF before (and I've done it myself):
the files contain bitmap page images but the OCR text is also there, in a
layer beneath the images.