Simon Spero <[log in to unmask]> wrote: > Another option is to use the ABBYY FineReader > SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>. > Annoyingly, the linux version is one release behind the windows SDK (which > has improved support for multi core processing of single document). Since > Owen's problem is embarrassingly parallel, multi-core tuning isn't as > useful as being able to run on a local cluster or regional grid. ABBYY > software tends to be a little pricey, but the results are usually very good. If you're going to OCR, Nuance OmniPage is also very good, and I believe costs about the same as FineReader. We also use tOCR, from Transym, which is Windows-only, but very accurate and cheap. I have yet to see decent results on complicated pages (technical papers) from either OCRopus or Tesseract with the default models that they come with; I believe they're both still aimed at book page OCR. Bill