Print

Print


Tesseract is free, but in my experience, to make it work you usually
have to train up a model, though the model that comes with it seems to
be set up for scanning English book pages, so may be appropriate for
library use.

OCRopus, from a research group in Germany, seems more powerful than
Tesseract, and is also freeware, but is (IMO) currently a pain to set
up.  And, again, to get good results, you often need to train a model.
But it seems to have much more functionality than Tesseract (which may
or may not be a good thing :-).

If you have Microsoft Office (versions including MS Office 2003 or later
but prior to MS Office 2010) on a Windows machine, you also have (had)
Microsoft's OCR package, which exposes its functionality through a COM
interface, so you can call it from other programs.  See
http://msdn.microsoft.com/en-us/library/aa202819(v=office.11).aspx

Similarly, Google Docs 3.0 offers free OCR via the Google Docs API.

I've also tried TOCR, a $100-per-machine Windows-only OCR library from
www.transym.com, a British MOD spin-off.  Comes as a DLL, plus a simple
application, and you can build your own application to use the DLL.
Pretty accurate, for printed English text -- gives bounding boxes and
word and character confidences.

Bill