LISTSERV 16.5 - CODE4LIB Archives

On Oct 15, 2013, at 10:44 AM, Eric Lease Morgan <[log in to unmask]> wrote:

> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8


On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[log in to unmask]> wrote:

> Could this also be done by Apache Tika? Or do I miss a crucial point?
> 
> http://tika.apache.org/1.4/gettingstarted.html



To some great degree I have replaced the text extraction routine in my PDF2TXT script with Tika allowing the tool to read a much wider number of types of documents (PDF, Word, Mac Pages, Powerpoint (maybe), etc.) "Thank you Nicolas." I have also created the barest of Git repositories hosting the (Perl) code:

  * PDF2TXT - http://bit.ly/1bJRyh8
  * Git repository - https://github.com/ericleasemorgan/pdf2txt

Just a reminder, PDF2TXT extracts plain text from a file, and does some rudimentary text mining against the result.

—
Eric Lease Morgan
University of Notre Dame