LISTSERV 16.5 - CODE4LIB Archives

On Feb 10, 2019, at 9:38 PM, Conal Tuohy <[log in to unmask]> wrote:

> You may have tried this already, but it seems that Hathi also offer PDF-
> and EBM-formatted data at the volume level. Do those formats include the
> OCR text? I have seen this done in PDF before (and I've done it myself):
> the files contain bitmap page images but the OCR text is also there, in a
> layer beneath the images.

Alas, programmatically downloading a PDF file with embedded OCR is not an option. The documentation [1] says a format called ebm is possible, and pdf or epub are for the future. After programmatically authenticating and submitting the following RESTful URLs, I get either error 400 ("invalid or missing format parameter value format=pdf") or 403 ("insufficient privilege")

  https://babel.hathitrust.org/cgi/htd/volume/uva.x000274833?format=pdf
  https://babel.hathitrust.org/cgi/htd/volume/uva.x000274833?format=ebm

Apparently, PDF is not supported (yet), and ebm is restricted to the Espressnet Project.


On Feb 11, 2019, at 9:52 AM, Angelina Zaytsev <[log in to unmask]> wrote:

> As a University of Notre Dame librarian, you should be able to log into HathiTrust by clicking the login button, selecting University of Notre Dame from the dropdown menu, and then logging in with your Notre Dame username and password. Once you do so, you'll be able to download the full pdf through the user interface (the full pdf download option is not available to non-logged in, non-member users) at https://hdl.handle.net/2027/uva.x000274833 . That option should be easier for retrieving the full pdf than using the Data API. 
> 
> If you just need the plain text OCR for a few books, you may want to download the pdfs from the user interface and use the "Export to" feature in Adobe to save the OCR that's embedded within the pdf as a txt file. 
> 
> If you need the OCR for a larger number of volumes, then you may want to consider requesting a dataset (see https://www.hathitrust.org/datasets ) or using the HathiTrust Research Center services (see https://analytics.hathitrust.org/ ). The datasets are more appropriate for thousands of volumes.

The process outlined above is feasible for the analysis of a few documents, no more than a dozen and probably less. The process -- while functional -- is not feasible for the person who wants to study the complete works for X, all the things written in English during a particular decade, or any number of other subsets of the 'Trust. Nor is the English major, historian, or even typical librarian going to VPN into a virtual machine, work from the command line, invoke the secure environment, and then write Python scripts to do their good work. The learning curve is too high.


On Feb 11, 2019, at 1:04 PM, Kyle Banerjee <[log in to unmask]> wrote:

> Haven't used the Hathi API before. Is multithreading possible or do
> tech/policy constraints make that approach a nonstarter or otherwise not
> worth pursuing?
> 
> Peeking the documentation, I noticed a htd:numpages element. If that is
> usable, it would prevent the need to rely on errors to detect the document
> end.

Multithreading? Yes, I've given that some thought. Now-a-days our computers have multiple cores. My computer at work is nothing very special, and it has 8 cores. My computer at home as 4. When I bought them I had no idea they had more than one.  8-)  In the recent past I learned more about parallel processing, and yes, multithreading is an option; use each core to get a page, and when all pages are gotten, assemble the whole. Both cool & "kewl".

htd:numpages? I'd like to know more about this. I didn't see that element in the documentation, nor in the various metadata files I've downloaded.


The HathiTrust is such a rich resource, but it is not easy to use at the medium scale. Reading & analyzing a few documents is easy. It is entirely possible to generate PDF files, download them, print them (gasp!), extract their underlying plain (OCR) text, and use both traditional as well as non-traditional methods (text mining) to read their content. At the other end of the scale I might be able to count & tabulate all of the adjectives used in the 19th Century or see when the word "ice cream" first appeared in the lexicon.

On the other hand, I believe more realistic use cases exist: analyzing the complete works of Author X, comparing & contrasting Author X with Author Y, learning how the expression or perception of gender may have changed over time, determining whether or not there are themes associated with specific places, etc.

I imagine the following workflow:

  1. create HathiTrust collection
  2. download collection as CSV file
  3. use something like Excel, a database program, or OpenRefine
     to create subsets of the collection
  4. programmatically download items' content & metadata
  5. update CSV file with downloaded & gleaned information
  6. do analysis against the result
  7. share results & analysis

Creating the collection (#1) is easy. Search the 'Trust, mark items of interest, repeat until done (or tired). 

Downloading (#2) is trivial. Mash the button.

Creating subsets (#3) is easier than one might expect. Yes, there are MANY duplicates in a collection, but OpenRefine is GREAT at normalizing ("clustering") data, and once it is normalized, duplicates can be removed confidently. In the end, a "refined" set of HathiTrust identifiers can be output. 

Given a set of identifiers, it ought to be easy to programmatically download (#4) the many flavors of 'Trust items: PDF, OCRed plain text, bibliographic metadata, and the cool JSON files with embedded part-of-speech analysis. This is the part which is giving me the most difficulty. Slow; download speeds of 1000 bytes/minute. [2] Access control & authentication, which I sincerely understand & appreciate. Multiple data structures. For example, the bibliographic metadata is presented as a stream of JSON, and embedded in it is an escaped XML file, which, in turn, is the manifestation of a MARC bibliographic record. Yikes!

After the many flavors are downloaded, more interesting information can be gleaned: sentences, parts-of-speech, named entities, readability scores, sentiment measures, log-likelihood ratios, "topics" & other types of clusters, definitive characteristics of similarly classified documents, etc. In the end the researcher would have created a rich & thorough dataset (#5). This is the sort of work I do on a day-to-day basis. 

Through traditional reading as well as through statistics, the researcher can then do #6 against the printed PDF files and dataset. This is where I provide assistance, but I don't do the "real" work; this is primarily the work of discipline-specific researchers. 

Again, the HathiTrust is really cool, but getting content out of it is not easy. But maybe I'm trying to use it when my use case is secondary to the 'Trust's primary purpose. After all, isn't the 'Trust primarily about preservation? "An elephant never forgets."


[1] documentation - https://www.hathitrust.org/documents/hathitrust-data-api-v2_20150526.pdf
[2] At a rate of 1000 bytes/minute, it would take you approximately 60 seconds to download this email message.

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [log in to unmask]
w: cds.library.nd.edu