There is a method named getActualText() in PDFBox, there are some listserv
postings (circa 2012) that indicate that the command-line PDFBox did not
support extraction of the ActualText contents at that time. That may have
changed. I'd like to know more.
Thank you Andrew for sending me scurrying to learn about ActualText. I
don't think we have any in any of the PDFs that I'm indexing, but I
wouldn't have known it existed without your posting.
On Mon, Feb 8, 2016 at 11:56 AM, Han, Yan - (yhan) <[log in to unmask]>
> Yes. Use iText or PDFBox
> These are common PDF libraries.
> On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" <
> [log in to unmask] on behalf of [log in to unmask]> wrote:
> >Hi all,
> >I am working with PDF files in some South Asian and South East Asian
> >languages. Each PDF has ActualText added for each tag in the PDF. Each PDF
> >has ActualText as an alternative forvthe visible text layer in the PDF.
> >Is anyone aware of tools the will allow me to index and search PDFs based
> >on the ActualText content rather than the visible text layers in the PDF?
> >Andrew Cunningham
> >[log in to unmask]