LISTSERV 16.5 - CODE4LIB Archives

On Feb 10, 2019, at 8:50 PM, Eric Lease Morgan <[log in to unmask]> wrote:

> I've finally figured out how to get raw OCR text out of the HathiTrust API, but it is really slow. Any hints out there?...


In a fit of creativity, I hacked together some Bash/Python scripts to programmatically download plain (OCR) text as well as PDF files from the HathiTrust. Here is synopsis on how to use them:

  Given an access key, secret token, and a HathiTrust identifier,
  output plain text as well as PDF versions of a book.

  $ ./bin/htid2txt.sh <token> <key> <identifier>
  $ ./bin/htid2pdf.sh <token> <key> <identifier> <length>
  $ ./bin/htid2books.sh <token> <key> <identifier>
  $ ./bin/collection2books.sh <token> <key> <tsv>

The process is not fast but very functional. For more detail, see the GitHub repository --> https://github.com/ericleasemorgan/htid2books  We now return you to the regularly scheduled programming.

--
Eric Morgan
University of Notre Dame