> For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Looks very good, and thanks for sharing it. (It's certainly not the first piece of software called pdf2txt, but that probably doesn't matter.) > PDF2TXT extracts the text from an OCRed PDF document The file I tried was digital native (probably from Word) so perhaps outside your intended scope. The text output was fairly similar to that from pdftotext (in Ubuntu poppler-utils package), perhaps better in losing the arbitrary line breaks, but fell over on macrons. There were a lot of Māori words and the vowels with macrons disappeared - e.g. Pākehā => Pkeh. I assume Unicode issues were also at the heart of %3Cunknown%3E being one of the "most frequent verbs". The link for this [1] gives a regex error. Cheers David [1] http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=verbs&id=1381700598&lemma=%3Cunknown%3E