LISTSERV 16.5 - CODE4LIB Archives

You might take a look at Tesseract [1]. On a typical Linux box:

$ tesseract input.tif outputName hocr

renders html with some coordinate information. You might be able to process
from that output to ALTO.

Cheers,
Bridger
[1] http://code.google.com/p/tesseract-ocr/


On Thu, Sep 6, 2012 at 8:29 AM, Michael Beccaria
<[log in to unmask]>wrote:

> I inadvertently purchase ABBYY Finereader 11 Corporate thinking that it
> would be capable of outputting to ALTO XML. I was wrong. ABBYY Finereader
> Engine does:/
>
> Ultimately, I want to OCR some newspaper images and export them to ALTO
> XML and, until the proof of concept is done, I want to try to do it on the
> cheap. My plan this morning was to write some scripts to OCR them using
> Microsoft Office Document Imaging (MODI) and then export the results to
> ALTO XML which could be a big project. Has anyone done this before or know
> of a quick and dirty way to get some OCR data?
> Thanks,
> Mike Beccaria
> Systems Librarian
> Paul Smith's College
> 518.327.6376
>