Art Rhyno talked about doing this with scans of old community newspapers a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A) Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs out to the PCs in the library's public lab while the library was closed at night. - David On 2014/12/11 03:59, Chris Fitzpatrick wrote: > Tesseract is going to be slow, and there might not much you can do about > that. > > You can do a couple of things, like set up a processes that run on AWS EC2 > spot instances, so you can put a standing bid order on AWS instances and > only run your OCR when the price drops. > > Or you can buy ABBYY , which is much faster. > > b,chris. > > b,chris. > > > On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee <[log in to unmask]> > wrote: > >>> I’m not quite sure if I understand the question, but if all you want to >> do is pull the text out of an OCR’ed PDF file, then I have found both Tika >> and PDFtotext to be useful tools.... >>> >>> On the other hand, if you need to do the OCR itself, then employing >> Tesseract is probably the way to go. >> >> For clarity, I have to do the OCR itself. I've been using CAM::PDF to >> extract existing text. >> >> Kyle >>