Hi all, In response to Laura's comment, I thought I'd share that at UBC we've included the 'direct-to-page-link' functionality in our Open Collections search interface <https://open.library.ubc.ca/search?q=trout&p=0&sort=0&view=1&circle=n&dBegin=&dEnd=&c=2&collection=bcnewspapers>. It is not loaded by default (you must select the 'detailed view' option, or click to expand a particular result) because, as Chad mentioned, it has quite a bit of overhead and in our testing only some (very vocal) users consistently clicked the links. We use ElasticSearch, but it works much the same way Josh described: Firing additional queries for each 'compound object' to search the page-level full text metadata. -Schuyler On Wed, Mar 2, 2016 at 10:00 AM, Laura Buchholz <[log in to unmask]> wrote: > Thanks guys, and thank you Shaun, for following up. This is exactly what I > was hoping to learn. > > I have to admit I'm surprised that the "direct-to-page-link" functionality > isn't more common in the newer/inspiring digital collections. It exists in > contentDM (not saying that is reason it should continue to exist), and > seems intuitively useful. We're planning on doing some usability testing > soon, and I'm going to try to get feedback on this feature. > > On Tue, Mar 1, 2016 at 7:51 AM, Gum, Josh <[log in to unmask]> > wrote: > > > Shaun, > > > > Thanks, I’m psyched to be at OSU! > > > > I think you’ve nailed down the process here, and there are a couple > > concepts that I wanted to follow-up on; > > > > 1. “Download document from search results list” : This would be a simple > > enhancement to the rendering of each search result and exposing the > > download link.. The software has access to all of the necessary values > > (document ID, and how to generate a “downloads” link for it) at render > > time, so adding a new link should be trivial.. It seems like it would be > a > > good enhancement to me. > > > > 2. “Direct-to-page link” : Generating a link to guide a PDF reader to a > > specific page [1] seems easy, although I’m not sure that every reader > would > > work the same. So the missing piece is being able to associate a SOLR hit > > with the page it was found in the PDF.. So, I think you’re right about > > needing to index each page individually in order to facilitate rendering > a > > link to a specific page related to the search result hit being rendered > on > > the page. > > > > I can’t speak to the history behind implementing the search the way it is > > right now.. But it does seem like both of these concepts would be great > > additions to the next installment of OregonDigital! > > > > [1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3 > > > > ——————— > > Josh Gum > > Oregon State University Libraries and Press > > > > > > > > > > > > On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" < > > [log in to unmask] on behalf of [log in to unmask]> wrote: > > > > >Josh, > > >Congrats on the new gig, and thank you for this explanation of > > OregonDigital’s BookReader integration. I’m sorry I wasn’t more specific > > about this, but I think the original question had less to do with the > > BookReader integration, and more to do with a non-frameworky explanation > of > > configuring Solr to return direct links to pages where the keywords > appear > > in a “compound” object, such as a book. > > > > > >As the original poster (Laura Buchholz) mentioned, it seems like > > OregonDigital does not provide direct links until after the BookReader is > > loaded. It’s only then that pins are placed on the “slider nav” to > > indicate where the keyword appears. So, to answer the original question, > > it seems like all the full-text may be dumped into a single Solr field > that > > returns the object in the initial search result, and then upon loading > the > > BookReader makes a subsequent query (limited to that one object) retrieve > > the “data payload” in your example to then locate the exact pages where > the > > terms appear? Is that what’s going on there? > > > > > >I suppose if you wanted to return all the page numbers in the original > > search query, you may have to send each page individually to Solr to be > > indexed, and if you have a viewer with conventions for "deep linking" > (like > > the BookReader has) you could generate the link for each page and index > it > > to provide this functionality. > > > > > >I was curious as folks were posting all the inspiring digital > collections > > sites earlier today, so I looked for this pattern but didn’t see it. > Most > > of the apps use the same pattern as OregonDigital (although my testing > was > > not particularly thorough, so let me know if I’m wrong, folks!). On the > > otherhand, you do see the "direct-to-page link" interface with both > Amazon > > and Google Books search, which takes you directly to the page from the > > initial search results. > > > > > >So, I’m not sure if this was a conscious design decision on the part of > > library digital collections creators, if the pattern is followed because > > it’s considered a “best practice” or a “convention” in our field, or if > it > > was just simpler to implement. > > > > > >Thanks again for the follow up, > > >Shaun > > > > > >> On Feb 26, 2016, at 2:51 PM, Gum, Josh <[log in to unmask]> > > wrote: > > >> > > >> I’m very new (<1 month) to Oregon State University, library > technology, > > and Code4Lib. So please bear with me. Also, I’m going to put a disclaimer > > out that I may be missing some of the picture here.. I’m willing to lend > a > > hand digging into more details if needed, so please feel free to ask. > > >> > > >> Also.. I’m going to split this part of the discussion into a separate > > thread, so we can address the question regarding the OregonDigital > > BookReader integration. I’ve done some digging this morning, and spoke > to a > > colleague who took part in some of the text extraction for PDF assets in > > OregonDigital.. I’m hopeful that these details are enough to help connect > > the dots regarding our integration. > > >> > > >> ———————————— > > >> When ingesting a PDF asset [1], we have a shell based processor [2] > > which executes “pdftotext” [3] to extract and store the text from a pdf > > with bounding boxes around each word in the file. > > >> > > >> The command executed on the server: > > >> pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox > > >> > > >> The web UI for viewing a PDF and highlighting results is tied to > > BookReader [4], which has a great amount of functionality and is well > > documented online! [5] > > >> > > >> The BookReader is making calls to a “full_text” action on the > > document_controller to find the location of the search terms. [6] This > > JSONP call to our web server uses > > OregonDigital::OCR::BookreaderSearchGenerator [7] to supply the properly > > formatted page and bounding box results to BookReader to use in updating > > its UI with the appropriate highlights and place marker icons. If you use > > something like the Chrome DevTools while searching for a term on the > > BookReader UI, you can see the data payload that is returned from the > > server. For instance, here’s a snippet of one search I did: > > >> > > >> > > >> (apologies if the tabs don’t remain in the email) > > >> matches: [ > > >> { > > >> par: [ > > >> { > > >> page: 2, > > >> boxes: [ > > >> {r: 128.62286274509802, l: > > 101.30935784313726, b: 27.52538962121212, t: 19.953774090909093, page: 2} > > >> {r: 59.883534313725484, l: > > 29.41176470588235, b: 242.4078138636364, t: 234.83619833333336, page: 2} > > >> {r: 106.32754411764705, l: > > 80.37296078431372, b: 546.3512438560606, t: 538.7796283257576, page: 2} > > >> text: "McKenzie Highway {{{Historic}}} > > District… > > >> } > > >> ] > > >> } > > >> ] > > >> > > >> > > >> [1] > > > https://github.com/OregonDigital/oregondigital/blob/master/app/models/document.rb > > >> > > >> [2] > > > https://github.com/OregonDigital/oregondigital/blob/d82d944d55dd087d2670b3f065725ef0e5ddc4ce/lib/hydra/derivatives/pdf_text_processor.rb > > >> [3] http://www.manpagez.com/man/1/pdftotext/ > > >> [4] http://github.com/openlibrary/bookreader/ > > >> [5] https://openlibrary.org/dev/docs/bookreader > > >> [6] > > > https://github.com/OregonDigital/oregondigital/blob/master/app/controllers/document_controller.rb > > >> [7] > > > https://github.com/OregonDigital/oregondigital/blob/master/lib/oregon_digital/ocr/bookreader_search_generator.rb > > >> ——————————— > > >> > > >> Josh Gum > > >> Oregon State University Libraries and Press > > >> > > >> > > >> > > >> > > >> > > >> > > >> On 2/26/16, 7:07 AM, "Code for Libraries on behalf of Shaun D. Ellis" > < > > [log in to unmask] on behalf of [log in to unmask]> wrote: > > >> > > >>> … //SNIPPED > > >>> I have to admit that I was disappointed that the recent question > about > > full-text searching basics (behind OregonDigital’s in-page highlighting > of > > keywords in the IA Bookreader) went basically unanswered. This was a > > well-articulated legitimate question, and at least a few people on this > > list should be able to answer it. It’s actually on my list to try to do > it > > so that I can report back, but maybe someone could save me the trouble > and > > quench our curiosity? > > >>> > > >>> Cheers, > > >>> Shaun > > > > > > -- > Laura Buchholz > Digital Projects Librarian > Reed College Library > 503-517-7629 > [log in to unmask] >