Print

Print


Eric,

>  5. Use pdttotext to extract the OCRed text
>    from the PDF and index it along with
>    the MyLibrary metadata using Solr. [3, 4]
>

Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library