Eric,
> 5. Use pdttotext to extract the OCRed text
> from the PDF and index it along with
> the MyLibrary metadata using Solr. [3, 4]
>
Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.
[1] http://wiki.apache.org/solr/ExtractingRequestHandler
Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library
|