Here's a post on how easy it is to send PDF documents to Solr from Java:
<http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/
>
Not only can you post PDF (and other rich content) files to Solr for
indexing, you can also as shown in that blog entry extract the text
from such files and have it returned to the client. This Solr
capability makes the tool chain a bit simpler.
Erik
On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:
> Hi all,
>
> I would like to suggest an API for extracting text (including
> highlighted or
> annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
> This is a Java API (has C# port), and it helped me a lot, when we
> worked
> with extraordinary PDF files.
>
> Solr uses Tika (http://lucene.apache.org/tika) for extracting text
> from
> documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
> to extract from PDF files, and it is a great tool for the normal PDF
> files,
> but it has (at least had) some features, which I didn't satisfied
> with:
>
> - it consumed more memory comparing with iText, and couldn't
> read files above a given size (this was large, about 1 GB, but we
> had even larger files)
>
> - it couldn't handled correctly the conditional hypens at the end of
> the line
> - it had poorer documentation then iText, and its API was also
> poorer (that time the Manning published the iText in Action book).
>
> Our PDF files were double layered (original hi-res image + OCR-ed
> text),
> several thousands pages length documents (Hungarian scientific
> journals,
> the diary of the Houses of Parliament from the 19th century etc.).
> We indexed
> the content with Lucene, and in the UI we showed one page per screen,
> so the user didn't need to download the full PDF. We extracted the
> Table of contents from the PDF as well, and we implemented it in the
> web UI,
> so the user can browse pages according to the full file's TOC.
>
> This project happened two years ago, so it is possible, that lots of
> things
> were changed since that time.
>
> Király Péter
> http://eXtensibleCatalog.org
>
> ----- Original Message ----- From: "Mark A. Matienzo" <[log in to unmask]
> >
> To: <[log in to unmask]>
> Sent: Tuesday, September 15, 2009 3:56 PM
> Subject: Re: [CODE4LIB] indexing pdf files
>
>
>> Eric,
>>
>>> 5. Use pdttotext to extract the OCRed text
>>> from the PDF and index it along with
>>> the MyLibrary metadata using Solr. [3, 4]
>>>
>>
>> Have you considered using Solr's ExtractingRequestHandler [1] for the
>> PDFs? We're using it at NYPL with pretty great success.
>>
>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler
>>
>> Mark A. Matienzo
>> Applications Developer, Digital Experience Group
>> The New York Public Library
|