Hi Penelope, Of the document you write, the key part for this discussion seems to be this: Some suggested ways to make the scanned information accessible in a > seamless manner are: > - The catalogue records to have two links one to the actual document > and the other to a search page that enables searching all the documents. > The second link could be something lile “Click here to go to the full-text > search for departmental reports”. > - An easy to use (user friendly) full-text search interface. > > So are you asking how to make a full-text search interface using the OCR results from Eric's tool? This doesn't at all answer your question, but gives a pointer to OCR quality control: "Case Study: Using Perl and CGI Scripts to Automate a Quality Control Workflow for Scanned Congressional Documents" http://journal.code4lib.org/articles/6731 I think if you ask a more particular question (that doesn't rely on reading your draft), you might get a better answer. -Jodi On Mon, Oct 14, 2013 at 6:48 AM, Penelope Campbell < [log in to unmask]> wrote: > Dear Eric, > Thanks for this. > As a small special library (solo librarian) in an Australian State > Government Department I use DB/Text works which has a feature of > importing documents so that the full text can be read. It though only > imports the full-text not what you have done which is really great. I > wrote a small piece (see attached) explaining what I am in the process > of doing. I am using the library catalogue records as metadata. But I > am hoping for something more. I do really want to open up the > collection and make the information discoverable more than just the > Library catalogue . I had contacted Juame Nualart who wrote a paper on > some ways to present terms called Texty. > http://informationr.net/ir/18-2/paper581.html But it is not a piece of > software. I am quite interested in what you have done. I am just tyring > to work out a way to show relevancy and this may be something I could > integrate into the Library catalogue. > > I hope you can take the time to reply to me. > Thank you > > Penelope Campbell | Library Manager > Department of Family and Community Services | Housing NSW > T 02 8753 8732 | F 02 8753 8734 > A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131 > A Locked bag 4001 Ashfield BC NSW, 1800 > E [log in to unmask] > W www.housing.nsw.gov.au > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Eric Lease Morgan > Sent: Saturday, 12 October 2013 2:16 AM > To: [log in to unmask] > Subject: [CODE4LIB] pdf2txt > > > For a limited period of time I am making publicly available a Web-based > program called PDF2TXT -- http://bit.ly/1bJRyh8 > > PDF2TXT extracts the text from an OCRed PDF document and then does some > rudimentary "distant reading" against the text in the form of word > clouds, readability scores, concordance features, and "maps" > (histograms) illustrating where terms appear in a text. > > > Here is the idea behind the application: > > 1. In the Libraries I see people scanning, scanning, and > scanning. I suppose these people then go home and read the > document. They might even print it. These documents are long. > Moreover, I'll bet they have multiple documents. > > 2. Text mining requires digitized text, but PDF documents are > frequently full of formatting. At the same time, they often > have the text underneath. Our scanning software does OCR. > > 3. By extracting the text from PDF documents, I can facilitate > a different -- additional -- type of analysis against sets of > one or more documents. PDF2TXT is the first step in this > process. > > What is really cool is that PDF2TXT works for many of the articles > downloadable from the Libraries's article indexes. Search an article > index. Download a full text, PDF version of the article. Feed it to > PDF2TXT. Get more out of your article. > > PDF2TXT currently has "creeping featuritis" -- meaning that it is > growing in weird directions. Your feedback is more than welcome. (I > know. The output is ugly.) Also, please be gentle with it because it > does not process things the size of the Bible. > > -- > [cid:116F6092-2AB6-4E95-8199-25639542726A] > > Eric Lease Morgan > Digital Initiatives Librarian > > University of Notre Dame > Room 131, Hesburgh Libraries > Notre Dame, IN 46556 > o: 574-631-8604 > e: [log in to unmask]<mailto:[log in to unmask]> > > [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5] > > ========================================================== > > Security Statement > > This email may be confidential and contain privileged information. If you > are not the intended recipient you must not use, disclose, copy or > distribute this email, including any attachments. Confidentiality and legal > privilege attached to this communication are not waived or lost by reason > of mistaken delivery to you. If you have received the email in error please > delete and notify the sender. Any views expressed in this email are those > of the author and do not necessarily represent those of the department, > except where the sender expressly, and with authority, states them to be > the views of the Department of Family and Community Services NSW. The > department does not represent, warrant or guarantee that the integrity of > this email has been maintained, or that the communication is free of error, > virus, interception, inference or interference. > > ========================================================== >