LISTSERV 16.5 - CODE4LIB Archives

On Oct 13, 2013, at 6:21 PM, David Friggens <[log in to unmask]> wrote:

>>> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8
>> 
>> PDF2TXT extracts the text from an OCRed PDF document
> 
> The file I tried was digital native (probably from Word) so perhaps
> outside your intended scope. The text output was fairly similar to
> that from pdftotext (in Ubuntu poppler-utils package), perhaps better
> in losing the arbitrary line breaks, but fell over on macrons. There
> were a lot of Māori words and the vowels with macrons disappeared -
> e.g. Pākehā => Pkeh.
> 
> I assume Unicode issues were also at the heart of %3Cunknown%3E being
> one of the "most frequent verbs".  The link for this [1] gives a regex
> error.


David, yes, there is a misconception that the program does OCR, and I hope to resolve that soon. (Wish me luck. The problem is not the programming, but the installation of Tesseract.) When it comes to encoding, I think I can fix that as well. --Eric