On Oct 13, 2013, at 6:21 PM, David Friggens <[log in to unmask]> wrote: >>> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8 >> >> PDF2TXT extracts the text from an OCRed PDF document > > The file I tried was digital native (probably from Word) so perhaps > outside your intended scope. The text output was fairly similar to > that from pdftotext (in Ubuntu poppler-utils package), perhaps better > in losing the arbitrary line breaks, but fell over on macrons. There > were a lot of Māori words and the vowels with macrons disappeared - > e.g. Pākehā => Pkeh. > > I assume Unicode issues were also at the heart of %3Cunknown%3E being > one of the "most frequent verbs". The link for this [1] gives a regex > error. David, yes, there is a misconception that the program does OCR, and I hope to resolve that soon. (Wish me luck. The problem is not the programming, but the installation of Tesseract.) When it comes to encoding, I think I can fix that as well. --Eric