You might want to consider using http://www.documentcloud.org to host
your users document. That would also take care of
privacy/authentication concerns. I know of a project in journalism
domain (http://overview.ap.org/) which does that.
As far as I remember they do provide an API interface and do some named
entity recognition as well.
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Eric Lease Morgan
Sent: 11 October 2013 18:58
To: [log in to unmask]
Subject: Re: [CODE4LIB] pdf2txt
On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]>
>> For a limited period of time I am making publicly available a
>> Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8
> Very slick, good work. I can see where this tool can be very helpful.
> It does have some issues with some characters, but this is rather
> common with most systems.
Again, thank you for the support. Yes, there are some escaping issues to
be resolved. "Release early. Release often." I need help with the
graphic design in general.
Here's an enhancement I thought of:
1. allow readers to authenticate
2. allow readers to upload documents
3. documents get saved in readers' cache
4. allow interface to list documents in the cache
5. provide text mining services against reader-selected documents
6. go to Step #1
It would also be cool if I could figure out how to finish the
installation of Tesseract to enable OCRing. 
 OCRing -
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4142 / Virus Database: 3604/6734 - Release Date: