On Feb 10, 2019, at 8:50 PM, Eric Lease Morgan <[log in to unmask]> wrote:
> I've finally figured out how to get raw OCR text out of the HathiTrust API, but it is really slow. Any hints out there?...
In a fit of creativity, I hacked together some Bash/Python scripts to programmatically download plain (OCR) text as well as PDF files from the HathiTrust. Here is synopsis on how to use them:
Given an access key, secret token, and a HathiTrust identifier,
output plain text as well as PDF versions of a book.
$ ./bin/htid2txt.sh <token> <key> <identifier>
$ ./bin/htid2pdf.sh <token> <key> <identifier> <length>
$ ./bin/htid2books.sh <token> <key> <identifier>
$ ./bin/collection2books.sh <token> <key> <tsv>
The process is not fast but very functional. For more detail, see the GitHub repository --> https://github.com/ericleasemorgan/htid2books We now return you to the regularly scheduled programming.
--
Eric Morgan
University of Notre Dame
|