Tesseract is going to be slow, and there might not much you can do about that. You can do a couple of things, like set up a processes that run on AWS EC2 spot instances, so you can put a standing bid order on AWS instances and only run your OCR when the price drops. Or you can buy ABBYY , which is much faster. b,chris. b,chris. On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee <[log in to unmask]> wrote: > > I’m not quite sure if I understand the question, but if all you want to > do is pull the text out of an OCR’ed PDF file, then I have found both Tika > and PDFtotext to be useful tools.... > > > > On the other hand, if you need to do the OCR itself, then employing > Tesseract is probably the way to go. > > For clarity, I have to do the OCR itself. I've been using CAM::PDF to > extract existing text. > > Kyle >