Print

Print


I put together some patches for determining the coordinates of bounding 
boxes on github with Tesseract [1], that's an extra feature of ABBYY which 
is invaluable for activities like highlighting search terms on the 
original image. For many materials, I think Tesseract is a serious rival 
to ABBYY for accuracy, one of the big factors seems to be how much 
contrast can be introduced into the source image to separate the 
characters from the background. ABBYY has impressive options for enlisting 
multiple machines for large quantities of scanned images, but that path is 
fairly pricey and it is a very windows-centric solution. Tesseract can fit 
into a Hadoop framework, which would be one approach for large quantities 
of materials and is more platform independent. ABBYY will probably come 
close to delivering the best OCR can offer straight out of the box but 
Tesseract is worth the extra hoops if you have a steady stream of incoming 
material, especially if the material is going straight from the page to 
the scanner, and does not represent the "image of an image" encounters 
found with things like the scans of microfilm reels.

art
---
1. https://github.com/artunit/ossocr