Thanks to all for the info and suggestions - we'll have a look at them.
Via another route I've had http://snowtide.com/PDFTextStream recommended (commercial, but looks like they are generally open to offering academic licenses for free at least for a limited period) - anyone tried that?
Owen Stephens Consulting
Email: [log in to unmask]
Telephone: 0121 288 6936
On 22 Jun 2011, at 03:43, Bill Janssen wrote:
> Simon Spero <[log in to unmask]> wrote:
>> Another option is to use the ABBYY FineReader
>> Annoyingly, the linux version is one release behind the windows SDK (which
>> has improved support for multi core processing of single document). Since
>> Owen's problem is embarrassingly parallel, multi-core tuning isn't as
>> useful as being able to run on a local cluster or regional grid. ABBYY
>> software tends to be a little pricey, but the results are usually very good.
> If you're going to OCR, Nuance OmniPage is also very good, and I believe
> costs about the same as FineReader. We also use tOCR, from Transym,
> which is Windows-only, but very accurate and cheap. I have yet to see
> decent results on complicated pages (technical papers) from either
> OCRopus or Tesseract with the default models that they come with; I
> believe they're both still aimed at book page OCR.