LISTSERV 16.5 - CODE4LIB Archives

Thanks Levy will look at PDFBox and see what i can leverage from it.

Andrew


On 9 February 2016 at 04:33, Levy, Michael <[log in to unmask]> wrote:

> There is a method named getActualText() in PDFBox, there are some listserv
> postings (circa 2012) that indicate that the command-line PDFBox did not
> support extraction of the ActualText contents at that time. That may have
> changed. I'd like to know more.
>
> Thank you Andrew for sending me scurrying to learn about ActualText. I
> don't think we have any in any of the PDFs that I'm indexing, but I
> wouldn't have known it existed without your posting.
>
>
> On Mon, Feb 8, 2016 at 11:56 AM, Han, Yan - (yhan) <[log in to unmask]
> >
> wrote:
>
> > Yes. Use iText or PDFBox
> >
> > These are common PDF libraries.
> >
> >
> >
> >
> >
> > On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" <
> > [log in to unmask] on behalf of [log in to unmask]> wrote:
> >
> > >Hi all,
> > >
> > >I am working with PDF files in some South Asian and South East Asian
> > >languages. Each PDF has ActualText added for each tag in the PDF. Each
> PDF
> > >has ActualText as an alternative forvthe visible text layer in the PDF.
> > >
> > >Is anyone aware of tools the will allow me to index and search PDFs
> based
> > >on the ActualText content rather than the visible text layers in the
> PDF?
> > >
> > >Andrew
> > >
> > >--
> > >Andrew Cunningham
> > >[log in to unmask]
> >
>



-- 
Andrew Cunningham
[log in to unmask]