Print

Print


Hi all,

I would like to suggest an API for extracting text (including highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we worked
with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF files,
but it has (at least had) some features, which I didn't satisfied with:

- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed text),
several thousands pages length documents (Hungarian scientific journals,
the diary of the Houses of Parliament from the 19th century etc.). We 
indexed
the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the web UI,
so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of things
were changed since that time.

Király Péter
http://eXtensibleCatalog.org

----- Original Message ----- 
From: "Mark A. Matienzo" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files


> Eric,
>
>>  5. Use pdttotext to extract the OCRed text
>>    from the PDF and index it along with
>>    the MyLibrary metadata using Solr. [3, 4]
>>
>
> Have you considered using Solr's ExtractingRequestHandler [1] for the
> PDFs? We're using it at NYPL with pretty great success.
>
> [1] http://wiki.apache.org/solr/ExtractingRequestHandler
>
> Mark A. Matienzo
> Applications Developer, Digital Experience Group
> The New York Public Library
>