Very slick, good work. I can see where this tool can be very helpful. It
does have some issues with some characters, but this is rather common with
most systems.
On Fri, Oct 11, 2013 at 11:16 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>
> For a limited period of time I am making publicly available a Web-based
> program called PDF2TXT -- http://bit.ly/1bJRyh8
>
> PDF2TXT extracts the text from an OCRed PDF document and then does some
> rudimentary "distant reading" against the text in the form of word clouds,
> readability scores, concordance features, and "maps" (histograms)
> illustrating where terms appear in a text.
>
> Here is the idea behind the application:
>
> 1. In the Libraries I see people scanning, scanning, and
> scanning. I suppose these people then go home and read the
> document. They might even print it. These documents are long.
> Moreover, I'll bet they have multiple documents.
>
> 2. Text mining requires digitized text, but PDF documents are
> frequently full of formatting. At the same time, they often
> have the text underneath. Our scanning software does OCR.
>
> 3. By extracting the text from PDF documents, I can facilitate
> a different -- additional -- type of analysis against sets of
> one or more documents. PDF2TXT is the first step in this
> process.
>
> What is really cool is that PDF2TXT works for many of the articles
> downloadable from the Libraries's article indexes. Search an article index.
> Download a full text, PDF version of the article. Feed it to PDF2TXT. Get
> more out of your article.
>
> PDF2TXT currently has "creeping featuritis" -- meaning that it is growing
> in weird directions. Your feedback is more than welcome. (I know. The
> output is ugly.) Also, please be gentle with it because it does not process
> things the size of the Bible.
>
> --
> [cid:116F6092-2AB6-4E95-8199-25639542726A]
>
> Eric Lease Morgan
> Digital Initiatives Librarian
>
> University of Notre Dame
> Room 131, Hesburgh Libraries
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]<mailto:[log in to unmask]>
>
> [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
>
>
|