ABBYY sdk does have a linux option which can be driven from the command
line. I'm not sure if the linux version is having as active development as
the windows one, as the linux sdk is one major behind the windows one, but
the linux ocr engine for that version had some improvements over the
corresponding windows one.
There are demo versions for download of most systems, and I think that it
shouldn't be too hard for NYPL to get sales reps and loaner set-ups.
On Nov 5, 2011 4:01 PM, "Bill Janssen" <[log in to unmask]> wrote:
> Tesseract is free, but in my experience, to make it work you usually
> have to train up a model, though the model that comes with it seems to
> be set up for scanning English book pages, so may be appropriate for
> library use.
> OCRopus, from a research group in Germany, seems more powerful than
> Tesseract, and is also freeware, but is (IMO) currently a pain to set
> up. And, again, to get good results, you often need to train a model.
> But it seems to have much more functionality than Tesseract (which may
> or may not be a good thing :-).
> If you have Microsoft Office (versions including MS Office 2003 or later
> but prior to MS Office 2010) on a Windows machine, you also have (had)
> Microsoft's OCR package, which exposes its functionality through a COM
> interface, so you can call it from other programs. See
> Similarly, Google Docs 3.0 offers free OCR via the Google Docs API.
> I've also tried TOCR, a $100-per-machine Windows-only OCR library from
> www.transym.com, a British MOD spin-off. Comes as a DLL, plus a simple
> application, and you can build your own application to use the DLL.
> Pretty accurate, for printed English text -- gives bounding boxes and
> word and character confidences.