I put together some patches for determining the coordinates of bounding
boxes on github with Tesseract [1], that's an extra feature of ABBYY which
is invaluable for activities like highlighting search terms on the
original image. For many materials, I think Tesseract is a serious rival
to ABBYY for accuracy, one of the big factors seems to be how much
contrast can be introduced into the source image to separate the
characters from the background. ABBYY has impressive options for enlisting
multiple machines for large quantities of scanned images, but that path is
fairly pricey and it is a very windows-centric solution. Tesseract can fit
into a Hadoop framework, which would be one approach for large quantities
of materials and is more platform independent. ABBYY will probably come
close to delivering the best OCR can offer straight out of the box but
Tesseract is worth the extra hoops if you have a steady stream of incoming
material, especially if the material is going straight from the page to
the scanner, and does not represent the "image of an image" encounters
found with things like the scans of microfilm reels.
art
---
1. https://github.com/artunit/ossocr
|