Shaun,
Thanks, I’m psyched to be at OSU!
I think you’ve nailed down the process here, and there are a couple concepts that I wanted to follow-up on;
1. “Download document from search results list” : This would be a simple enhancement to the rendering of each search result and exposing the download link.. The software has access to all of the necessary values (document ID, and how to generate a “downloads” link for it) at render time, so adding a new link should be trivial.. It seems like it would be a good enhancement to me.
2. “Direct-to-page link” : Generating a link to guide a PDF reader to a specific page [1] seems easy, although I’m not sure that every reader would work the same. So the missing piece is being able to associate a SOLR hit with the page it was found in the PDF.. So, I think you’re right about needing to index each page individually in order to facilitate rendering a link to a specific page related to the search result hit being rendered on the page.
I can’t speak to the history behind implementing the search the way it is right now.. But it does seem like both of these concepts would be great additions to the next installment of OregonDigital!
[1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3
———————
Josh Gum
Oregon State University Libraries and Press
On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" <[log in to unmask] on behalf of [log in to unmask]> wrote:
>Josh,
>Congrats on the new gig, and thank you for this explanation of OregonDigital’s BookReader integration. I’m sorry I wasn’t more specific about this, but I think the original question had less to do with the BookReader integration, and more to do with a non-frameworky explanation of configuring Solr to return direct links to pages where the keywords appear in a “compound” object, such as a book.
>
>As the original poster (Laura Buchholz) mentioned, it seems like OregonDigital does not provide direct links until after the BookReader is loaded. It’s only then that pins are placed on the “slider nav” to indicate where the keyword appears. So, to answer the original question, it seems like all the full-text may be dumped into a single Solr field that returns the object in the initial search result, and then upon loading the BookReader makes a subsequent query (limited to that one object) retrieve the “data payload” in your example to then locate the exact pages where the terms appear? Is that what’s going on there?
>
>I suppose if you wanted to return all the page numbers in the original search query, you may have to send each page individually to Solr to be indexed, and if you have a viewer with conventions for "deep linking" (like the BookReader has) you could generate the link for each page and index it to provide this functionality.
>
>I was curious as folks were posting all the inspiring digital collections sites earlier today, so I looked for this pattern but didn’t see it. Most of the apps use the same pattern as OregonDigital (although my testing was not particularly thorough, so let me know if I’m wrong, folks!). On the otherhand, you do see the "direct-to-page link" interface with both Amazon and Google Books search, which takes you directly to the page from the initial search results.
>
>So, I’m not sure if this was a conscious design decision on the part of library digital collections creators, if the pattern is followed because it’s considered a “best practice” or a “convention” in our field, or if it was just simpler to implement.
>
>Thanks again for the follow up,
>Shaun
>
>> On Feb 26, 2016, at 2:51 PM, Gum, Josh <[log in to unmask]> wrote:
>>
>> I’m very new (<1 month) to Oregon State University, library technology, and Code4Lib. So please bear with me. Also, I’m going to put a disclaimer out that I may be missing some of the picture here.. I’m willing to lend a hand digging into more details if needed, so please feel free to ask.
>>
>> Also.. I’m going to split this part of the discussion into a separate thread, so we can address the question regarding the OregonDigital BookReader integration. I’ve done some digging this morning, and spoke to a colleague who took part in some of the text extraction for PDF assets in OregonDigital.. I’m hopeful that these details are enough to help connect the dots regarding our integration.
>>
>> ————————————
>> When ingesting a PDF asset [1], we have a shell based processor [2] which executes “pdftotext” [3] to extract and store the text from a pdf with bounding boxes around each word in the file.
>>
>> The command executed on the server:
>> pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox
>>
>> The web UI for viewing a PDF and highlighting results is tied to BookReader [4], which has a great amount of functionality and is well documented online! [5]
>>
>> The BookReader is making calls to a “full_text” action on the document_controller to find the location of the search terms. [6] This JSONP call to our web server uses OregonDigital::OCR::BookreaderSearchGenerator [7] to supply the properly formatted page and bounding box results to BookReader to use in updating its UI with the appropriate highlights and place marker icons. If you use something like the Chrome DevTools while searching for a term on the BookReader UI, you can see the data payload that is returned from the server. For instance, here’s a snippet of one search I did:
>>
>>
>> (apologies if the tabs don’t remain in the email)
>> matches: [
>> {
>> par: [
>> {
>> page: 2,
>> boxes: [
>> {r: 128.62286274509802, l: 101.30935784313726, b: 27.52538962121212, t: 19.953774090909093, page: 2}
>> {r: 59.883534313725484, l: 29.41176470588235, b: 242.4078138636364, t: 234.83619833333336, page: 2}
>> {r: 106.32754411764705, l: 80.37296078431372, b: 546.3512438560606, t: 538.7796283257576, page: 2}
>> text: "McKenzie Highway {{{Historic}}} District…
>> }
>> ]
>> }
>> ]
>>
>>
>> [1] https://github.com/OregonDigital/oregondigital/blob/master/app/models/document.rb
>>
>> [2] https://github.com/OregonDigital/oregondigital/blob/d82d944d55dd087d2670b3f065725ef0e5ddc4ce/lib/hydra/derivatives/pdf_text_processor.rb
>> [3] http://www.manpagez.com/man/1/pdftotext/
>> [4] http://github.com/openlibrary/bookreader/
>> [5] https://openlibrary.org/dev/docs/bookreader
>> [6] https://github.com/OregonDigital/oregondigital/blob/master/app/controllers/document_controller.rb
>> [7] https://github.com/OregonDigital/oregondigital/blob/master/lib/oregon_digital/ocr/bookreader_search_generator.rb
>> ———————————
>>
>> Josh Gum
>> Oregon State University Libraries and Press
>>
>>
>>
>>
>>
>>
>> On 2/26/16, 7:07 AM, "Code for Libraries on behalf of Shaun D. Ellis" <[log in to unmask] on behalf of [log in to unmask]> wrote:
>>
>>> … //SNIPPED
>>> I have to admit that I was disappointed that the recent question about full-text searching basics (behind OregonDigital’s in-page highlighting of keywords in the IA Bookreader) went basically unanswered. This was a well-articulated legitimate question, and at least a few people on this list should be able to answer it. It’s actually on my list to try to do it so that I can report back, but maybe someone could save me the trouble and quench our curiosity?
>>>
>>> Cheers,
>>> Shaun
|