Very cool.
But, why only for a limited period of time?
-Sean
On 10/11/13 11:16 AM, "Eric Lease Morgan" <[log in to unmask]> wrote:
>
>For a limited period of time I am making publicly available a Web-based
>program called PDF2TXT -- http://bit.ly/1bJRyh8
>
>PDF2TXT extracts the text from an OCRed PDF document and then does some
>rudimentary "distant reading" against the text in the form of word
>clouds, readability scores, concordance features, and "maps" (histograms)
>illustrating where terms appear in a text.
>
>Here is the idea behind the application:
>
> 1. In the Libraries I see people scanning, scanning, and
> scanning. I suppose these people then go home and read the
> document. They might even print it. These documents are long.
> Moreover, I'll bet they have multiple documents.
>
> 2. Text mining requires digitized text, but PDF documents are
> frequently full of formatting. At the same time, they often
> have the text underneath. Our scanning software does OCR.
>
> 3. By extracting the text from PDF documents, I can facilitate
> a different -- additional -- type of analysis against sets of
> one or more documents. PDF2TXT is the first step in this
> process.
>
>What is really cool is that PDF2TXT works for many of the articles
>downloadable from the Libraries's article indexes. Search an article
>index. Download a full text, PDF version of the article. Feed it to
>PDF2TXT. Get more out of your article.
>
>PDF2TXT currently has "creeping featuritis" -- meaning that it is growing
>in weird directions. Your feedback is more than welcome. (I know. The
>output is ugly.) Also, please be gentle with it because it does not
>process things the size of the Bible.
>
>--
>[cid:116F6092-2AB6-4E95-8199-25639542726A]
>
>Eric Lease Morgan
>Digital Initiatives Librarian
>
>University of Notre Dame
>Room 131, Hesburgh Libraries
>Notre Dame, IN 46556
>o: 574-631-8604
>e: [log in to unmask]<mailto:[log in to unmask]>
>
>[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
>
|