Very cool. But, why only for a limited period of time? -Sean On 10/11/13 11:16 AM, "Eric Lease Morgan" <[log in to unmask]> wrote: > >For a limited period of time I am making publicly available a Web-based >program called PDF2TXT -- http://bit.ly/1bJRyh8 > >PDF2TXT extracts the text from an OCRed PDF document and then does some >rudimentary "distant reading" against the text in the form of word >clouds, readability scores, concordance features, and "maps" (histograms) >illustrating where terms appear in a text. > >Here is the idea behind the application: > > 1. In the Libraries I see people scanning, scanning, and > scanning. I suppose these people then go home and read the > document. They might even print it. These documents are long. > Moreover, I'll bet they have multiple documents. > > 2. Text mining requires digitized text, but PDF documents are > frequently full of formatting. At the same time, they often > have the text underneath. Our scanning software does OCR. > > 3. By extracting the text from PDF documents, I can facilitate > a different -- additional -- type of analysis against sets of > one or more documents. PDF2TXT is the first step in this > process. > >What is really cool is that PDF2TXT works for many of the articles >downloadable from the Libraries's article indexes. Search an article >index. Download a full text, PDF version of the article. Feed it to >PDF2TXT. Get more out of your article. > >PDF2TXT currently has "creeping featuritis" -- meaning that it is growing >in weird directions. Your feedback is more than welcome. (I know. The >output is ugly.) Also, please be gentle with it because it does not >process things the size of the Bible. > >-- >[cid:116F6092-2AB6-4E95-8199-25639542726A] > >Eric Lease Morgan >Digital Initiatives Librarian > >University of Notre Dame >Room 131, Hesburgh Libraries >Notre Dame, IN 46556 >o: 574-631-8604 >e: [log in to unmask]<mailto:[log in to unmask]> > >[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5] >