Thanks Levy will look at PDFBox and see what i can leverage from it. Andrew On 9 February 2016 at 04:33, Levy, Michael <[log in to unmask]> wrote: > There is a method named getActualText() in PDFBox, there are some listserv > postings (circa 2012) that indicate that the command-line PDFBox did not > support extraction of the ActualText contents at that time. That may have > changed. I'd like to know more. > > Thank you Andrew for sending me scurrying to learn about ActualText. I > don't think we have any in any of the PDFs that I'm indexing, but I > wouldn't have known it existed without your posting. > > > On Mon, Feb 8, 2016 at 11:56 AM, Han, Yan - (yhan) <[log in to unmask] > > > wrote: > > > Yes. Use iText or PDFBox > > > > These are common PDF libraries. > > > > > > > > > > > > On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" < > > [log in to unmask] on behalf of [log in to unmask]> wrote: > > > > >Hi all, > > > > > >I am working with PDF files in some South Asian and South East Asian > > >languages. Each PDF has ActualText added for each tag in the PDF. Each > PDF > > >has ActualText as an alternative forvthe visible text layer in the PDF. > > > > > >Is anyone aware of tools the will allow me to index and search PDFs > based > > >on the ActualText content rather than the visible text layers in the > PDF? > > > > > >Andrew > > > > > >-- > > >Andrew Cunningham > > >[log in to unmask] > > > -- Andrew Cunningham [log in to unmask]