Print

Print


Howdy all,

I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I hoped it
would be.

Is there an easier/faster process that I'm missing? Perl friendly solutions
are preferred because this fits in as part of a larger process. If I am
already using my best option, what kind of image parameters are recommended
if I want to hit the point of diminishing returns but not necessarily go
for the best possible? Thanks,

kyle