LISTSERV 16.5 - CODE4LIB Archives

On 10/17/2013 9:43 AM, Eric Lease Morgan wrote:
> On Oct 16, 2013, at 10:56 AM, Robert Haschart<[log in to unmask]>  wrote:
>
>> The abstract extraction routine I have been working on does use
>> tesseract internally for doing OCR when it encounters a document that
>> doesn't have usable full-text.  I agree that tesseract is not that easy
>> to install, especially if (as in my case) you do not have root/sudo
>> access to the machine.  Since I have gone through installing tesseract
>> quite recently, perhaps my experience can be helpful to you.
>
> Robert, can you outline the process you used to get Tesseract to do OCR agains PDF documents? I installed Tesseract a few months ago, but I couldn't figure out how to get to work against PDF, only some image files. Any pointers would be greatly appreciated. (Hmmm. Maybe Tesseract doesn't do PDF files, only image files, and I need to convert my PDFs to images, and then the to Tesseract.) --Eric Morgan
That correct.   I use ghostscript to print the pdf to a series of .tiff 
files, and then use tesseract to perform ocr on the individual .tiff 
images, producing a .txt file for each page.   Since I'm only looking to 
extract the abstract I limit the ghostscript to the first 5 pages, and 
then do post-processing and various heuristics to find and fix the 
abstract.  One particular issue I've found is that tesseract is fond of 
detecting ligatures such as "fi" "fl "ff" "ffl" "ffi" but doesn't seem 
to be very good at selecting the correct one (at least for my data), so 
one of the post-processing steps is expand the ligature to individual 
characters does a dictionary look-up to help select the correct expansion.