Print

Print


And beyond Tesseract is Ocropus (http://code.google.com/p/ocropus/),
which uses Tesseract (and eventually other ocr engines) to generate
positional OCR in an HTML format. I wonder if you could process that
HTML slightly to put the TIFF in the background, then use an HTML to PDF
tool to generate your final PDF. Or something like that. Googling
"ocropus pdf" finds a few projects and discussions that might be
helpful.

Peter 

> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On 
> Behalf Of Bridger Dyson-Smith
> Sent: Friday, October 17, 2008 6:56 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] OCR PDFs
> 
> If you haven't already, take a look at tesseract ( 
> http://code.google.com/p/tesseract-ocr/). There's some 
> discussion of using tesseract and shell scripting to work 
> with tiffs to pdfs to ocr'd text, which isn't exactly what 
> you're wanting to do, I know, but may prove helpful 
> (http://www.groklaw.net/articlebasic.php?story=20061210115516438).
> Cheers!
> Bridger Dyson-Smith
> 
> 
> On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison 
> <[log in to unmask]> wrote:
> 
> > You might want to look at ABBYY Fine Reader 9.0 Professional, which 
> > can be driven from the command line.  Fine Reader  is used at the 
> > Library of Congress.  Here is a info link to get you 
> started (search "command"):
> >
> >
> > 
> http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Softwa
> > re/Nuance/omnipage_review.asp
> >
> > Regards,
> > Terry
> >
> > ------------------------------------
> > Terry Harrison
> > Project Manager
> > CACI
> > 5505 Robin Hood Road, Suite F
> > Norfolk, Va. 23508
> > Ph: 757.321.9120 x232
> > Fax: 757.321.8797
> > [log in to unmask]
> >
> 
>