LISTSERV 16.5 - CODE4LIB Archives

Hi Eric,

On Thu, Oct 17, 2013 at 09:43:04AM -0400, Eric Lease Morgan wrote:
> Robert, can you outline the process you used to get Tesseract to do
> OCR agains PDF documents? I installed Tesseract a few months ago,
> but I couldn't figure out how to get to work against PDF, only some
> image files. Any pointers would be greatly appreciated. (Hmmm. Maybe
> Tesseract doesn't do PDF files, only image files, and I need to
> convert my PDFs to images, and then the to Tesseract.) --Eric Morgan

Once you have Tesseract installed, the easiest way to use it for
adding an OCR text layer to PDF files is this Ruby script IMHO:
https://github.com/gkovacs/pdfocr
Geza Kovacs wrote it for Cuneiform and an old version of OCRopus.
I added Tesseract support later.

If you cannot use Ruby for some reason, I could upload a BASH script
doing the same thing.

Cheers,
Christian

-- 
  Christian Pietsch · http://purl.org/net/pietsch
  LibTec · Library Technology and Knowledge Management
  Bielefeld University Library, Bielefeld, Germany