We also used MS Office ORC engine to OCR student newspapers between 1914
and 2006. We ran the OCR overnight using a batch program. It took a few
weeks, but worked well.
Sarah
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Bill Janssen
Sent: Monday, August 02, 2010 2:09 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Free/Open OCR solutions?
Michael Beccaria <[log in to unmask]> wrote:
> Andrew,
> If you have MS Office, Microsoft has an OCR engine built in. I used it
> to OCR some college yearbooks at MPOW. It's not ABBYY but it works
> pretty well! It's scriptable using VBScript or your MS language of
> choice.
>
> http://msdn.microsoft.com/en-us/library/aa167607(office.11).aspx
> Notice the "OCR" method in the document.
>
> I can send you the scripts I have (they're short and simple) if you're
> interested in some working code. Let me know.
> Mike
Yes, I second that, it works pretty well. UpLib uses that by default
when you install onWindows. In fact, if you install one of the older
UpLib releases on a machine running Windows and having a recent Office,
it will create a Windows service that's a network OCR server callable
via HTTP from other machines. Even if you don't use UpLib for anything
else, you can still use the installed OCR service, though the output
format is somewhat UpLib-specific. Perhaps an "uplib-ocr-document"
command-line tool with hOCR output would be a good add to UpLib. It's a
one-liner in a UNIX environment:
cat `uplib-add-document --verbosity=0 --ocr --noupload foo.pdf | awk
'{ print $2; }'`/contents.txt | tail -n +3
My longer-term plan with UpLib is to move to OCRopus when it's out of
alpha.
Bill
|