Hi Penelope,

Of the document you write, the key part for this discussion seems to be

Some suggested ways to make the scanned information accessible in a
> seamless manner are:

>    - The catalogue records to have two links one to the actual document
>    and the other to a search page that enables searching all the documents.
>    The second link could be something lile “Click here to go to the full-text
>    search for departmental reports”.
>    - An easy to use (user friendly) full-text search interface.
> So are you asking how to make a full-text search interface using the OCR
results from Eric's tool?

This doesn't at all answer your question, but gives a pointer to OCR
quality control:
"Case Study: Using Perl and CGI Scripts to Automate a Quality Control
Workflow for Scanned Congressional Documents"

I think if you ask a more particular question (that doesn't rely on reading
your draft), you might get a better answer.


On Mon, Oct 14, 2013 at 6:48 AM, Penelope Campbell <
[log in to unmask]> wrote:

> Dear Eric,
> Thanks for this.
> As a small special library (solo librarian) in an Australian State
> Government Department I use DB/Text works which has a feature of
> importing documents so that the full text can be read. It though only
> imports the full-text not what you have done which is really great. I
>  wrote a small piece (see attached) explaining what I am in the process
> of doing. I am using the library catalogue records as metadata.  But I
> am hoping for something more.  I do really want to open up the
> collection and make the information discoverable more than just the
> Library catalogue . I had contacted Juame Nualart who wrote a paper on
> some ways to present terms called Texty.
> But it is not a piece of
> software. I am quite interested in what you have done. I am just tyring
> to work out a way to show relevancy and this may be something I could
> integrate into the Library catalogue.
> I hope you can take the time to reply to me.
> Thank you
> Penelope Campbell | Library Manager
> Department of Family and Community Services | Housing NSW
> T 02 8753 8732 | F 02 8753 8734
> A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131
> A Locked bag 4001 Ashfield BC NSW, 1800
> E [log in to unmask]
> W
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Eric Lease Morgan
> Sent: Saturday, 12 October 2013 2:16 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] pdf2txt
> For a limited period of time I am making publicly available a Web-based
> program called PDF2TXT --
> PDF2TXT extracts the text from an OCRed PDF document and then does some
> rudimentary "distant reading" against the text in the form of word
> clouds, readability scores, concordance features, and "maps"
> (histograms) illustrating where terms appear in a text.
> Here is the idea behind the application:
>   1. In the Libraries I see people scanning, scanning, and
>      scanning. I suppose these people then go home and read the
>      document. They might even print it. These documents are long.
>      Moreover, I'll bet they have multiple documents.
>   2. Text mining requires digitized text, but PDF documents are
>      frequently full of formatting. At the same time, they often
>      have the text underneath. Our scanning software does OCR.
>   3. By extracting the text from PDF documents, I can facilitate
>      a different -- additional -- type of analysis against sets of
>      one or more documents. PDF2TXT is the first step in this
>      process.
> What is really cool is that PDF2TXT works for many of the articles
> downloadable from the Libraries's article indexes. Search an article
> index. Download a full text, PDF version of the article. Feed it to
> PDF2TXT. Get more out of your article.
> PDF2TXT currently has "creeping featuritis" -- meaning that it is
> growing in weird directions. Your feedback is more than welcome. (I
> know. The output is ugly.) Also, please be gentle with it because it
> does not process things the size of the Bible.
> --
> [cid:116F6092-2AB6-4E95-8199-25639542726A]
> Eric Lease Morgan
> Digital Initiatives Librarian
> University of Notre Dame
> Room 131, Hesburgh Libraries
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]<mailto:[log in to unmask]>
> [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
> ==========================================================
> Security Statement
> This email may be confidential and contain privileged information. If you
> are not the intended recipient you must not use, disclose, copy or
> distribute this email, including any attachments. Confidentiality and legal
> privilege attached to this communication are not waived or lost by reason
> of mistaken delivery to you. If you have received the email in error please
> delete and notify the sender. Any views expressed in this email are those
> of the author and do not necessarily represent those of the department,
> except where the sender expressly, and with authority, states them to be
> the views of the Department of Family and Community Services NSW. The
> department does not represent, warrant or guarantee that the integrity of
> this email has been maintained, or that the communication is free of error,
> virus, interception, inference or interference.
> ==========================================================