On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[log in to unmask]> wrote:
> Could this also be done by Apache Tika? Or do I miss a crucial point?
Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF document as well as extract the text from a Word document. 'More experimenting, but thank you. code4lib++ --Eric Morgan