On Oct 15, 2013, at 10:44 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8
On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[log in to unmask]> wrote:
> Could this also be done by Apache Tika? Or do I miss a crucial point?
>
> http://tika.apache.org/1.4/gettingstarted.html
To some great degree I have replaced the text extraction routine in my PDF2TXT script with Tika allowing the tool to read a much wider number of types of documents (PDF, Word, Mac Pages, Powerpoint (maybe), etc.) "Thank you Nicolas." I have also created the barest of Git repositories hosting the (Perl) code:
* PDF2TXT - http://bit.ly/1bJRyh8
* Git repository - https://github.com/ericleasemorgan/pdf2txt
Just a reminder, PDF2TXT extracts plain text from a file, and does some rudimentary text mining against the result.
—
Eric Lease Morgan
University of Notre Dame
|