LISTSERV 16.5 - CODE4LIB Archives

On 10/15/13 11:45 AM, Eric Lease Morgan wrote:
> On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[log in to unmask]> wrote:
>
>> Could this also be done by Apache Tika? Or do I miss a crucial point?
>>
>> http://tika.apache.org/1.4/gettingstarted.html
>
>
> Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF document as well as extract the text from a Word document. 'More experimenting, but thank you. code4lib++  --Eric Morgan

In case they are of use to anyone, here are links I've collected over 
the years (some may be dead) to other tools that include the capability 
to extract text from a vector PDF (not a raster one that still needs to 
be OCRd):

* pdfx: http://pdfx.cs.man.ac.uk/

* LA-PDFText: https://code.google.com/p/lapdftext/

* pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX

* Apache PDFBox: http://pdfbox.apache.org/

* pdf2txt.py, part of PDFMiner: 
http://www.unixuser.org/~euske/python/pdfminer/

* pdftotext (part of xpdf)

See also the list at http://scholrev.org/hackathon/ and this discussion 
of using Jade, Gemini, and Adobe Acrobat to extract text from a PDF: 
http://www.ncbi.nlm.nih.gov/books/NBK61837/ .

--Kevin