You may want to consider how best to handle PDF files where the text would contain ligatures and glyph ids rather than the underlying characters. A. On 12/10/2013 4:58 AM, "Eric Lease Morgan" <[log in to unmask]> wrote: > On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]> > wrote: > > >> For a limited period of time I am making publicly available a Web-based > >> program called PDF2TXT -- http://bit.ly/1bJRyh8 > > > > Very slick, good work. I can see where this tool can be very helpful. > It > > does have some issues with some characters, but this is rather common > with > > most systems. > > Again, thank you for the support. Yes, there are some escaping issues to > be resolved. "Release early. Release often." I need help with the graphic > design in general. > > Here's an enhancement I thought of: > > 1. allow readers to authenticate > 2. allow readers to upload documents > 3. documents get saved in readers' cache > 4. allow interface to list documents in the cache > 5. provide text mining services against reader-selected documents > 6. go to Step #1 > > It would also be cool if I could figure out how to finish the installation > of Tesseract to enable OCRing. [1] > > [1] OCRing - > http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html > > -- > Eric Morgan >