LISTSERV 16.5 - CODE4LIB Archives

On Mon, 11 Feb 2019 at 11:51, Eric Lease Morgan <[log in to unmask]> wrote:

>
> I've finally figured out how to get raw OCR text out of the HathiTrust
> API, but it is really slow. Any hints out there?

...


> Am I missing something when it comes to the API?
>

You may have tried this already, but it seems that Hathi also offer PDF-
and EBM-formatted data at the volume level. Do those formats include the
OCR text? I have seen this done in PDF before (and I've done it myself):
the files contain bitmap page images but the OCR text is also there, in a
layer beneath the images.

-- 
Conal Tuohy
http://conaltuohy.com/
@conal_tuohy