On 10/17/2013 9:43 AM, Eric Lease Morgan wrote:
> On Oct 16, 2013, at 10:56 AM, Robert Haschart<[log in to unmask]> wrote:
>
>> The abstract extraction routine I have been working on does use
>> tesseract internally for doing OCR when it encounters a document that
>> doesn't have usable full-text. I agree that tesseract is not that easy
>> to install, especially if (as in my case) you do not have root/sudo
>> access to the machine. Since I have gone through installing tesseract
>> quite recently, perhaps my experience can be helpful to you.
>
> Robert, can you outline the process you used to get Tesseract to do OCR agains PDF documents? I installed Tesseract a few months ago, but I couldn't figure out how to get to work against PDF, only some image files. Any pointers would be greatly appreciated. (Hmmm. Maybe Tesseract doesn't do PDF files, only image files, and I need to convert my PDFs to images, and then the to Tesseract.) --Eric Morgan
That correct. I use ghostscript to print the pdf to a series of .tiff
files, and then use tesseract to perform ocr on the individual .tiff
images, producing a .txt file for each page. Since I'm only looking to
extract the abstract I limit the ghostscript to the first 5 pages, and
then do post-processing and various heuristics to find and fix the
abstract. One particular issue I've found is that tesseract is fond of
detecting ligatures such as "fi" "fl "ff" "ffl" "ffi" but doesn't seem
to be very good at selecting the correct one (at least for my data), so
one of the post-processing steps is expand the ligature to individual
characters does a dictionary look-up to help select the correct expansion.
|