Eric, You might want to consider using http://www.documentcloud.org to host your users document. That would also take care of privacy/authentication concerns. I know of a project in journalism domain (http://overview.ap.org/) which does that. As far as I remember they do provide an API interface and do some named entity recognition as well. Regards, Arash -----Original Message----- From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Lease Morgan Sent: 11 October 2013 18:58 To: [log in to unmask] Subject: Re: [CODE4LIB] pdf2txt On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]> wrote: >> For a limited period of time I am making publicly available a >> Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 > > Very slick, good work. I can see where this tool can be very helpful. > It does have some issues with some characters, but this is rather > common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. "Release early. Release often." I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4142 / Virus Database: 3604/6734 - Release Date: 10/08/13