On 10/15/2013 12:25 PM, Eric Lease Morgan wrote: > On Oct 14, 2013, at 4:49 PM, Robert Haschart<[log in to unmask]> wrote: > >>> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8 >> Although based on some subsequent messages where you mention tesseract >> maybe I misunderstood and your tool only handles pdfs that have already >> been OCR'ed which would explain why the second document (which only >> contains page images) fails. > Robert, that's correct. As of right now the document needs to have been previously OCRed. --Eric The abstract extraction routine I have been working on does use tesseract internally for doing OCR when it encounters a document that doesn't have usable full-text. I agree that tesseract is not that easy to install, especially if (as in my case) you do not have root/sudo access to the machine. Since I have gone through installing tesseract quite recently, perhaps my experience can be helpful to you. -Bob Haschart