LISTSERV 16.5 - CODE4LIB Archives

Michael Beccaria <[log in to unmask]> wrote:

> Andrew, 
> If you have MS Office, Microsoft has an OCR engine built in. I used it
> to OCR some college yearbooks at MPOW. It's not ABBYY but it works
> pretty well! It's scriptable using VBScript or your MS language of
> choice.
> 
> http://msdn.microsoft.com/en-us/library/aa167607(office.11).aspx
> Notice the "OCR" method in the document.
> 
> I can send you the scripts I have (they're short and simple) if you're
> interested in some working code. Let me know.
> Mike

Yes, I second that, it works pretty well.  UpLib uses that by default
when you install onWindows.  In fact, if you install one of the older
UpLib releases on a machine running Windows and having a recent Office,
it will create a Windows service that's a network OCR server callable
via HTTP from other machines.  Even if you don't use UpLib for anything
else, you can still use the installed OCR service, though the output
format is somewhat UpLib-specific.  Perhaps an "uplib-ocr-document"
command-line tool with hOCR output would be a good add to UpLib.  It's a
one-liner in a UNIX environment:

  cat `uplib-add-document --verbosity=0 --ocr --noupload foo.pdf  | awk '{ print $2; }'`/contents.txt | tail -n +3

My longer-term plan with UpLib is to move to OCRopus when it's out of
alpha.

Bill