Hi Mark, I suspect the tool wil only be able to handle select languages, and very doubtful you could develop a tool to handle non-LCG text. For a fully internationalised tool, you would have fo ignore all text layers in a PDF and run all PDFs through OCR to generate text. Then you'd need to apply very sophisticated word boundary identification routines. A. On 12/10/2013 9:40 AM, "Mark Pernotto" <[log in to unmask]> wrote: > Very cool tool, thank you! > > Putting my devil's advocate hat on, it doesn't parse foreign documents well > (I got it to break!). I also got inconsistent results feeding it PDF files > with tables embedded (but haven't been able to figure out what it is about > them it doesn't like). > > Just from a curiosity standpoint, what encoding is being utilized? I know > nothing about Perl. It seemed to have no problem parsing a dash (-) if it > was up against another character (2007-2012), but barfs when it's by itself > (2007 � 2012). I'm only referring to 'extracted text' mode. > > If it helps, I can send along *most* of my test PDF files used. > > Thank you! > .m > > > > > > On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan <[log in to unmask]> > wrote: > > > On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]> > > wrote: > > > > >> For a limited period of time I am making publicly available a > Web-based > > >> program called PDF2TXT -- http://bit.ly/1bJRyh8 > > > > > > Very slick, good work. I can see where this tool can be very helpful. > > It > > > does have some issues with some characters, but this is rather common > > with > > > most systems. > > > > Again, thank you for the support. Yes, there are some escaping issues to > > be resolved. "Release early. Release often." I need help with the graphic > > design in general. > > > > Here's an enhancement I thought of: > > > > 1. allow readers to authenticate > > 2. allow readers to upload documents > > 3. documents get saved in readers' cache > > 4. allow interface to list documents in the cache > > 5. provide text mining services against reader-selected documents > > 6. go to Step #1 > > > > It would also be cool if I could figure out how to finish the > installation > > of Tesseract to enable OCRing. [1] > > > > [1] OCRing - > > http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html > > > > -- > > Eric Morgan > > >