LISTSERV 16.5 - CODE4LIB Archives

I’m very new (<1 month) to Oregon State University, library technology, and Code4Lib. So please bear with me. Also, I’m going to put a disclaimer out that I may be missing some of the picture here.. I’m willing to lend a hand digging into more details if needed, so please feel free to ask.

Also.. I’m going to split this part of the discussion into a separate thread, so we can address the question regarding the OregonDigital BookReader integration. I’ve done some digging this morning, and spoke to a colleague who took part in some of the text extraction for PDF assets in OregonDigital.. I’m hopeful that these details are enough to help connect the dots regarding our integration. 

————————————
When ingesting a PDF asset [1], we have a shell based processor [2] which executes “pdftotext” [3] to extract and store the text from a pdf with bounding boxes around each word in the file. 

The command executed on the server:
pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox

The web UI for viewing a PDF and highlighting results is tied to BookReader [4], which has a great amount of functionality and is well documented online! [5]

The BookReader is making calls to a “full_text” action on the document_controller to find the location of the search terms. [6] This JSONP call to our web server uses OregonDigital::OCR::BookreaderSearchGenerator [7] to supply the properly formatted page and bounding box results to BookReader to use in updating its UI with the appropriate highlights and place marker icons. If you use something like the Chrome DevTools while searching for a term on the BookReader UI, you can see the data payload that is returned from the server. For instance, here’s a snippet of one search I did:


(apologies if the tabs don’t remain in the email)
matches: [
	{
		par: [
			{
				page: 2, 
				boxes: [
					{r: 128.62286274509802, l: 101.30935784313726, b: 27.52538962121212, t: 19.953774090909093, page: 2}
					{r: 59.883534313725484, l: 29.41176470588235, b: 242.4078138636364, t: 234.83619833333336, page: 2}
					{r: 106.32754411764705, l: 80.37296078431372, b: 546.3512438560606, t: 538.7796283257576, page: 2}
				text: "McKenzie Highway {{{Historic}}} District…
			}
		]
	}
]


[1] https://github.com/OregonDigital/oregondigital/blob/master/app/models/document.rb

[2] https://github.com/OregonDigital/oregondigital/blob/d82d944d55dd087d2670b3f065725ef0e5ddc4ce/lib/hydra/derivatives/pdf_text_processor.rb
[3] http://www.manpagez.com/man/1/pdftotext/
[4] http://github.com/openlibrary/bookreader/
[5] https://openlibrary.org/dev/docs/bookreader
[6] https://github.com/OregonDigital/oregondigital/blob/master/app/controllers/document_controller.rb
[7] https://github.com/OregonDigital/oregondigital/blob/master/lib/oregon_digital/ocr/bookreader_search_generator.rb
———————————

Josh Gum
Oregon State University Libraries and Press






On 2/26/16, 7:07 AM, "Code for Libraries on behalf of Shaun D. Ellis" <[log in to unmask] on behalf of [log in to unmask]> wrote:

> … //SNIPPED
>I have to admit that I was disappointed that the recent question about full-text searching basics (behind OregonDigital’s in-page highlighting of keywords in the IA Bookreader) went basically unanswered.  This was a well-articulated legitimate question, and at least a few people on this list should be able to answer it. It’s actually on my list to try to do it so that I can report back, but maybe someone could save me the trouble and quench our curiosity?
>
>Cheers,
>Shaun