You may want to consider how best to handle PDF files where the text would
contain ligatures and glyph ids rather than the underlying characters.
On 12/10/2013 4:58 AM, "Eric Lease Morgan" <[log in to unmask]> wrote:
> On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]>
> >> For a limited period of time I am making publicly available a Web-based
> >> program called PDF2TXT -- http://bit.ly/1bJRyh8
> > Very slick, good work. I can see where this tool can be very helpful.
> > does have some issues with some characters, but this is rather common
> > most systems.
> Again, thank you for the support. Yes, there are some escaping issues to
> be resolved. "Release early. Release often." I need help with the graphic
> design in general.
> Here's an enhancement I thought of:
> 1. allow readers to authenticate
> 2. allow readers to upload documents
> 3. documents get saved in readers' cache
> 4. allow interface to list documents in the cache
> 5. provide text mining services against reader-selected documents
> 6. go to Step #1
> It would also be cool if I could figure out how to finish the installation
> of Tesseract to enable OCRing. 
>  OCRing -
> Eric Morgan