LISTSERV 16.5 - CODE4LIB Archives

Thanks to all for the info and suggestions - we'll have a look at them.

Via another route I've had http://snowtide.com/PDFTextStream recommended (commercial, but looks like they are generally open to offering academic licenses for free at least for a limited period) - anyone tried that?

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936

On 22 Jun 2011, at 03:43, Bill Janssen wrote:

> Simon Spero <[log in to unmask]> wrote:
> 
>> Another option is to use the  ABBYY FineReader
>> SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>.
>> Annoyingly, the linux version is one release behind the windows SDK (which
>> has improved support for multi core processing of single document).  Since
>> Owen's problem  is embarrassingly parallel, multi-core tuning isn't as
>> useful as being able to run on a local cluster or regional grid.   ABBYY
>> software tends to be a little pricey, but the results are usually very good.
> 
> If you're going to OCR, Nuance OmniPage is also very good, and I believe
> costs about the same as FineReader.  We also use tOCR, from Transym,
> which is Windows-only, but very accurate and cheap.  I have yet to see
> decent results on complicated pages (technical papers) from either
> OCRopus or Tesseract with the default models that they come with; I
> believe they're both still aimed at book page OCR.
> 
> Bill