Haven't used the Hathi API before. Is multithreading possible or do
tech/policy constraints make that approach a nonstarter or otherwise not
worth pursuing?
Peeking the documentation, I noticed a htd:numpages element. If that is
usable, it would prevent the need to rely on errors to detect the document
end.
kyle
On Sun, Feb 10, 2019 at 5:51 PM Eric Lease Morgan <[log in to unmask]> wrote:
>
> I've finally figured out how to get raw OCR text out of the HathiTrust
> API, but it is really slow. Any hints out there?
>
> To use the HathiTrust Data API a person needs to first get a couple of
> access tokens. Applications then need to use the tokens to authenticate.
> Once this is done, a simple URL can be sent and cool stuff will be
> returned. For example, the following URL will return the first page of OCR:
>
> https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2
>
> By continually incrementing the URL, other pages can be gotten:
>
> https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2
> https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2
> https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2
>
> By incrementing the URL until an error is returned, one can get the whole
> of the document. I don't think there is a way to get the whole of the
> document in one go.
>
> Similarly, a person can get page images:
>
>
> https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2
>
> https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2
>
> https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2
>
> https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2
>
> Again, by incrementing the URL until an error is returned, all the images
> can be downloaded, and a PDF file could be created.
>
> By combining the traditional reading of a book (PDF) with the text mining
> of the OCR, very interesting things can take place. Thorough understanding
> could be obtained.
>
> Unfortunately, continually requesting individual pages seems laborious,
> not to mention, s l o w . It takes ten's of minutes to do the good work.
>
> Attached is the code I use to do the work. Can you suggest ways things
> could be sped up? Am I missing something when it comes to the API? Maybe if
> I do the work in a HathiTrust Research Center "capsule" things would be
> faster?
>
> --
> Eric Morgan
>
>
>
>
>
|