LISTSERV 16.5 - CODE4LIB Archives

I've finally figured out how to get raw OCR text out of the HathiTrust API, but it is really slow. Any hints out there?

To use the HathiTrust Data API a person needs to first get a couple of access tokens. Applications then need to use the tokens to authenticate. Once this is done, a simple URL can be sent and cool stuff will be returned. For example, the following URL will return the first page of OCR:

https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2

By continually incrementing the URL, other pages can be gotten:

https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2
https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2
https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2

By incrementing the URL until an error is returned, one can get the whole of the document. I don't think there is a way to get the whole of the document in one go.

Similarly, a person can get page images:

https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2
https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2
https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2
https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2

Again, by incrementing the URL until an error is returned, all the images can be downloaded, and a PDF file could be created.

By combining the traditional reading of a book (PDF) with the text mining of the OCR, very interesting things can take place. Thorough understanding could be obtained.

Unfortunately, continually requesting individual pages seems laborious, not to mention, s l o w . It takes ten's of minutes to do the good work.

Attached is the code I use to do the work. Can you suggest ways things could be sped up? Am I missing something when it comes to the API? Maybe if I do the work in a HathiTrust Research Center "capsule" things would be faster?

--
Eric Morgan