On Mon, 11 Feb 2019 at 11:51, Eric Lease Morgan <[log in to unmask]> wrote: > > I've finally figured out how to get raw OCR text out of the HathiTrust > API, but it is really slow. Any hints out there? ... > Am I missing something when it comes to the API? > You may have tried this already, but it seems that Hathi also offer PDF- and EBM-formatted data at the volume level. Do those formats include the OCR text? I have seen this done in PDF before (and I've done it myself): the files contain bitmap page images but the OCR text is also there, in a layer beneath the images. -- Conal Tuohy http://conaltuohy.com/ @conal_tuohy