LISTSERV 16.5 - CODE4LIB Archives

It depends on languages. Few years ago I tested many packages
for old roman languages mainly English, French, Dutch and German.
In terms of accuracy ABBYY was the best.

Karim Boughida
[log in to unmask]
[log in to unmask]


On Sat, Nov 5, 2011 at 5:08 PM, Art W Rhyno <[log in to unmask]> wrote:
> I put together some patches for determining the coordinates of bounding
> boxes on github with Tesseract [1], that's an extra feature of ABBYY which
> is invaluable for activities like highlighting search terms on the
> original image. For many materials, I think Tesseract is a serious rival
> to ABBYY for accuracy, one of the big factors seems to be how much
> contrast can be introduced into the source image to separate the
> characters from the background. ABBYY has impressive options for enlisting
> multiple machines for large quantities of scanned images, but that path is
> fairly pricey and it is a very windows-centric solution. Tesseract can fit
> into a Hadoop framework, which would be one approach for large quantities
> of materials and is more platform independent. ABBYY will probably come
> close to delivering the best OCR can offer straight out of the box but
> Tesseract is worth the extra hoops if you have a steady stream of incoming
> material, especially if the material is going straight from the page to
> the scanner, and does not represent the "image of an image" encounters
> found with things like the scans of microfilm reels.
>
> art
> ---
> 1. https://github.com/artunit/ossocr
>



--