LISTSERV 16.5 - CODE4LIB Archives

Hi all,

In response to Laura's comment, I thought I'd share that at UBC we've
included the 'direct-to-page-link' functionality in our Open Collections
search interface
<https://open.library.ubc.ca/search?q=trout&p=0&sort=0&view=1&circle=n&dBegin=&dEnd=&c=2&collection=bcnewspapers>.
It is not loaded by default (you must select the 'detailed view' option, or
click to expand a particular result) because, as Chad mentioned, it has
quite a bit of overhead and in our testing only some (very vocal) users
consistently clicked the links. We use ElasticSearch, but it works much the
same way Josh described: Firing additional queries for each 'compound
object' to search the page-level full text metadata.

-Schuyler



On Wed, Mar 2, 2016 at 10:00 AM, Laura Buchholz <[log in to unmask]>
wrote:

> Thanks guys, and thank you Shaun, for following up. This is exactly what I
> was hoping to learn.
>
> I have to admit I'm surprised that the "direct-to-page-link" functionality
> isn't more common in the newer/inspiring digital collections. It exists in
> contentDM (not saying that is reason it should continue to exist), and
> seems intuitively useful. We're planning on doing some usability testing
> soon, and I'm going to try to get feedback on this feature.
>
> On Tue, Mar 1, 2016 at 7:51 AM, Gum, Josh <[log in to unmask]>
> wrote:
>
> > Shaun,
> >
> > Thanks, I’m psyched to be at OSU!
> >
> > I think you’ve nailed down the process here, and there are a couple
> > concepts that I wanted to follow-up on;
> >
> > 1. “Download document from search results list” : This would be a simple
> > enhancement to the rendering of each search result and exposing the
> > download link.. The software has access to all of the necessary values
> > (document ID, and how to generate a “downloads” link for it) at render
> > time, so adding a new link should be trivial.. It seems like it would be
> a
> > good enhancement to me.
> >
> > 2. “Direct-to-page link” : Generating a link to guide a PDF reader to a
> > specific page [1] seems easy, although I’m not sure that every reader
> would
> > work the same. So the missing piece is being able to associate a SOLR hit
> > with the page it was found in the PDF.. So, I think you’re right about
> > needing to index each page individually in order to facilitate rendering
> a
> > link to a specific page related to the search result hit being rendered
> on
> > the page.
> >
> > I can’t speak to the history behind implementing the search the way it is
> > right now.. But it does seem like both of these concepts would be great
> > additions to the next installment of OregonDigital!
> >
> > [1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3
> >
> > ———————
> > Josh Gum
> > Oregon State University Libraries and Press
> >
> >
> >
> >
> >
> > On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" <
> > [log in to unmask] on behalf of [log in to unmask]> wrote:
> >
> > >Josh,
> > >Congrats on the new gig, and thank you for this explanation of
> > OregonDigital’s BookReader integration.  I’m sorry I wasn’t more specific
> > about this, but I think the original question had less to do with the
> > BookReader integration, and more to do with a non-frameworky explanation
> of
> > configuring Solr to return direct links to pages where the keywords
> appear
> > in a “compound” object, such as a book.
> > >
> > >As the original poster (Laura Buchholz) mentioned, it seems like
> > OregonDigital does not provide direct links until after the BookReader is
> > loaded.  It’s only then that pins are placed on the “slider nav” to
> > indicate where the keyword appears.  So, to answer the original question,
> > it seems like all the full-text may be dumped into a single Solr field
> that
> > returns the object in the initial search result, and then upon loading
> the
> > BookReader makes a subsequent query (limited to that one object) retrieve
> > the “data payload” in your example to then locate the exact pages where
> the
> > terms appear?  Is that what’s going on there?
> > >
> > >I suppose if you wanted to return all the page numbers in the original
> > search query, you may have to send each page individually to Solr to be
> > indexed, and if you have a viewer with conventions for "deep linking"
> (like
> > the BookReader has) you could generate the link for each page and index
> it
> > to provide this functionality.
> > >
> > >I was curious as folks were posting all the inspiring digital
> collections
> > sites earlier today, so I looked for this pattern but didn’t see it.
> Most
> > of the apps use the same pattern as OregonDigital (although my testing
> was
> > not particularly thorough, so let me know if I’m wrong, folks!).  On the
> > otherhand, you do see the "direct-to-page link" interface with both
> Amazon
> > and Google Books search, which takes you directly to the page from the
> > initial search results.
> > >
> > >So, I’m not sure if this was a conscious design decision on the part of
> > library digital collections creators, if the pattern is followed because
> > it’s considered a “best practice” or a “convention” in our field, or if
> it
> > was just simpler to implement.
> > >
> > >Thanks again for the follow up,
> > >Shaun
> > >
> > >> On Feb 26, 2016, at 2:51 PM, Gum, Josh <[log in to unmask]>
> > wrote:
> > >>
> > >> I’m very new (<1 month) to Oregon State University, library
> technology,
> > and Code4Lib. So please bear with me. Also, I’m going to put a disclaimer
> > out that I may be missing some of the picture here.. I’m willing to lend
> a
> > hand digging into more details if needed, so please feel free to ask.
> > >>
> > >> Also.. I’m going to split this part of the discussion into a separate
> > thread, so we can address the question regarding the OregonDigital
> > BookReader integration. I’ve done some digging this morning, and spoke
> to a
> > colleague who took part in some of the text extraction for PDF assets in
> > OregonDigital.. I’m hopeful that these details are enough to help connect
> > the dots regarding our integration.
> > >>
> > >> ————————————
> > >> When ingesting a PDF asset [1], we have a shell based processor [2]
> > which executes “pdftotext” [3] to extract and store the text from a pdf
> > with bounding boxes around each word in the file.
> > >>
> > >> The command executed on the server:
> > >> pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox
> > >>
> > >> The web UI for viewing a PDF and highlighting results is tied to
> > BookReader [4], which has a great amount of functionality and is well
> > documented online! [5]
> > >>
> > >> The BookReader is making calls to a “full_text” action on the
> > document_controller to find the location of the search terms. [6] This
> > JSONP call to our web server uses
> > OregonDigital::OCR::BookreaderSearchGenerator [7] to supply the properly
> > formatted page and bounding box results to BookReader to use in updating
> > its UI with the appropriate highlights and place marker icons. If you use
> > something like the Chrome DevTools while searching for a term on the
> > BookReader UI, you can see the data payload that is returned from the
> > server. For instance, here’s a snippet of one search I did:
> > >>
> > >>
> > >> (apologies if the tabs don’t remain in the email)
> > >> matches: [
> > >>      {
> > >>              par: [
> > >>                      {
> > >>                              page: 2,
> > >>                              boxes: [
> > >>                                      {r: 128.62286274509802, l:
> > 101.30935784313726, b: 27.52538962121212, t: 19.953774090909093, page: 2}
> > >>                                      {r: 59.883534313725484, l:
> > 29.41176470588235, b: 242.4078138636364, t: 234.83619833333336, page: 2}
> > >>                                      {r: 106.32754411764705, l:
> > 80.37296078431372, b: 546.3512438560606, t: 538.7796283257576, page: 2}
> > >>                              text: "McKenzie Highway {{{Historic}}}
> > District…
> > >>                      }
> > >>              ]
> > >>      }
> > >> ]
> > >>
> > >>
> > >> [1]
> >
> https://github.com/OregonDigital/oregondigital/blob/master/app/models/document.rb
> > >>
> > >> [2]
> >
> https://github.com/OregonDigital/oregondigital/blob/d82d944d55dd087d2670b3f065725ef0e5ddc4ce/lib/hydra/derivatives/pdf_text_processor.rb
> > >> [3] http://www.manpagez.com/man/1/pdftotext/
> > >> [4] http://github.com/openlibrary/bookreader/
> > >> [5] https://openlibrary.org/dev/docs/bookreader
> > >> [6]
> >
> https://github.com/OregonDigital/oregondigital/blob/master/app/controllers/document_controller.rb
> > >> [7]
> >
> https://github.com/OregonDigital/oregondigital/blob/master/lib/oregon_digital/ocr/bookreader_search_generator.rb
> > >> ———————————
> > >>
> > >> Josh Gum
> > >> Oregon State University Libraries and Press
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 2/26/16, 7:07 AM, "Code for Libraries on behalf of Shaun D. Ellis"
> <
> > [log in to unmask] on behalf of [log in to unmask]> wrote:
> > >>
> > >>> … //SNIPPED
> > >>> I have to admit that I was disappointed that the recent question
> about
> > full-text searching basics (behind OregonDigital’s in-page highlighting
> of
> > keywords in the IA Bookreader) went basically unanswered.  This was a
> > well-articulated legitimate question, and at least a few people on this
> > list should be able to answer it. It’s actually on my list to try to do
> it
> > so that I can report back, but maybe someone could save me the trouble
> and
> > quench our curiosity?
> > >>>
> > >>> Cheers,
> > >>> Shaun
> >
>
>
>
> --
> Laura Buchholz
> Digital Projects Librarian
> Reed College Library
> 503-517-7629
> [log in to unmask]
>