Software Developer, Digital Services Unit
Atlanta University Center, Robert W. Woodruff Library
email: [log in to unmask]; office: 1 404 978 2057
On 10/15/13 4:23 PM, "Arash.Joorabchi" <[log in to unmask]> wrote:
>You might want to consider using http://www.documentcloud.org to host
>your users document. That would also take care of
>privacy/authentication concerns. I know of a project in journalism
>domain (http://overview.ap.org/) which does that.
>As far as I remember they do provide an API interface and do some named
>entity recognition as well.
>From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>Eric Lease Morgan
>Sent: 11 October 2013 18:58
>To: [log in to unmask]
>Subject: Re: [CODE4LIB] pdf2txt
>On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]>
>>> For a limited period of time I am making publicly available a
>>> Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8
>> Very slick, good work. I can see where this tool can be very helpful.
>> It does have some issues with some characters, but this is rather
>> common with most systems.
>Again, thank you for the support. Yes, there are some escaping issues to
>be resolved. "Release early. Release often." I need help with the
>graphic design in general.
>Here's an enhancement I thought of:
> 1. allow readers to authenticate
> 2. allow readers to upload documents
> 3. documents get saved in readers' cache
> 4. allow interface to list documents in the cache
> 5. provide text mining services against reader-selected documents
> 6. go to Step #1
>It would also be cool if I could figure out how to finish the
>installation of Tesseract to enable OCRing. 
> OCRing -
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2014.0.4142 / Virus Database: 3604/6734 - Release Date: