On Oct 11, 2013, at 6:39 PM, Mark Pernotto <[log in to unmask]> wrote:
> Putting my devil's advocate hat on, it doesn't parse foreign documents well
> (I got it to break!). I also got inconsistent results feeding it PDF files
> with tables embedded (but haven't been able to figure out what it is about
> them it doesn't like).
Mark, foreign documents. Good point. Using a (Perl) module called… Well, I can't find it right now. It is possible to guess the language of a text. It does this by looking for and tabulating the number of various language stop words in a document. Once a language is determined, then different stop word lists can be applied to the document and the results ought to be better.
Also, please remember, parsing the document into sentences and words is directly related to the quality of the underlying OCR. Such is a limitation I am not able to overcome.