Haven't used the Hathi API before. Is multithreading possible or do tech/policy constraints make that approach a nonstarter or otherwise not worth pursuing? Peeking the documentation, I noticed a htd:numpages element. If that is usable, it would prevent the need to rely on errors to detect the document end. kyle On Sun, Feb 10, 2019 at 5:51 PM Eric Lease Morgan <[log in to unmask]> wrote: > > I've finally figured out how to get raw OCR text out of the HathiTrust > API, but it is really slow. Any hints out there? > > To use the HathiTrust Data API a person needs to first get a couple of > access tokens. Applications then need to use the tokens to authenticate. > Once this is done, a simple URL can be sent and cool stuff will be > returned. For example, the following URL will return the first page of OCR: > > https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2 > > By continually incrementing the URL, other pages can be gotten: > > https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2 > https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2 > https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2 > > By incrementing the URL until an error is returned, one can get the whole > of the document. I don't think there is a way to get the whole of the > document in one go. > > Similarly, a person can get page images: > > > https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2 > > https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2 > > https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2 > > https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2 > > Again, by incrementing the URL until an error is returned, all the images > can be downloaded, and a PDF file could be created. > > By combining the traditional reading of a book (PDF) with the text mining > of the OCR, very interesting things can take place. Thorough understanding > could be obtained. > > Unfortunately, continually requesting individual pages seems laborious, > not to mention, s l o w . It takes ten's of minutes to do the good work. > > Attached is the code I use to do the work. Can you suggest ways things > could be sped up? Am I missing something when it comes to the API? Maybe if > I do the work in a HathiTrust Research Center "capsule" things would be > faster? > > -- > Eric Morgan > > > > >