On 10/15/13 11:45 AM, Eric Lease Morgan wrote:
> On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[log in to unmask]> wrote:
>
>> Could this also be done by Apache Tika? Or do I miss a crucial point?
>>
>> http://tika.apache.org/1.4/gettingstarted.html
>
>
> Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF document as well as extract the text from a Word document. 'More experimenting, but thank you. code4lib++ --Eric Morgan
In case they are of use to anyone, here are links I've collected over
the years (some may be dead) to other tools that include the capability
to extract text from a vector PDF (not a raster one that still needs to
be OCRd):
* pdfx: http://pdfx.cs.man.ac.uk/
* LA-PDFText: https://code.google.com/p/lapdftext/
* pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX
* Apache PDFBox: http://pdfbox.apache.org/
* pdf2txt.py, part of PDFMiner:
http://www.unixuser.org/~euske/python/pdfminer/
* pdftotext (part of xpdf)
See also the list at http://scholrev.org/hackathon/ and this discussion
of using Jade, Gemini, and Adobe Acrobat to extract text from a PDF:
http://www.ncbi.nlm.nih.gov/books/NBK61837/ .
--Kevin
|