Thanks to all for the info and suggestions - we'll have a look at them. Via another route I've had http://snowtide.com/PDFTextStream recommended (commercial, but looks like they are generally open to offering academic licenses for free at least for a limited period) - anyone tried that? Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: [log in to unmask] Telephone: 0121 288 6936 On 22 Jun 2011, at 03:43, Bill Janssen wrote: > Simon Spero <[log in to unmask]> wrote: > >> Another option is to use the ABBYY FineReader >> SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>. >> Annoyingly, the linux version is one release behind the windows SDK (which >> has improved support for multi core processing of single document). Since >> Owen's problem is embarrassingly parallel, multi-core tuning isn't as >> useful as being able to run on a local cluster or regional grid. ABBYY >> software tends to be a little pricey, but the results are usually very good. > > If you're going to OCR, Nuance OmniPage is also very good, and I believe > costs about the same as FineReader. We also use tOCR, from Transym, > which is Windows-only, but very accurate and cheap. I have yet to see > decent results on complicated pages (technical papers) from either > OCRopus or Tesseract with the default models that they come with; I > believe they're both still aimed at book page OCR. > > Bill