On 2014-12-09 14:25, Kyle Banerjee wrote: > Howdy all, > > I've just started a project that involves harvesting large numbers of > scanned PDF's and extracting information from the text from the OCR output. > The process I've started with -- use imagemagick to convert to tiff and > tesseract to pull out the OCR -- is more system intensive than I hoped it > would be. > I asked around the office and the process seems sensible overall. One suggestion was to use pdfimages instead of imagemagick as that should be faster. However I would guess that most of the processing time is actually spent in tesseract so I don't know how much this suggestion will improve the overall performance. Regards. -- Mads Villadsen <[log in to unmask]> Statsbiblioteket It-udvikler