LISTSERV 16.5 - CODE4LIB Archives

On Oct 16, 2013, at 10:56 AM, Robert Haschart <[log in to unmask]> wrote:

> The abstract extraction routine I have been working on does use 
> tesseract internally for doing OCR when it encounters a document that 
> doesn't have usable full-text.  I agree that tesseract is not that easy 
> to install, especially if (as in my case) you do not have root/sudo 
> access to the machine.  Since I have gone through installing tesseract 
> quite recently, perhaps my experience can be helpful to you.


Robert, can you outline the process you used to get Tesseract to do OCR agains PDF documents? I installed Tesseract a few months ago, but I couldn't figure out how to get to work against PDF, only some image files. Any pointers would be greatly appreciated. (Hmmm. Maybe Tesseract doesn't do PDF files, only image files, and I need to convert my PDFs to images, and then the to Tesseract.) --Eric Morgan