Could this also be done by Apache Tika? Or do I miss a crucial point?
Apparently it has a command-line utility that extract metadata and content from
various document formats, and prints it to the standard output. The output
can then be supplied to text-analysing tools like Solr.
From: Code for Libraries [[log in to unmask]] On Behalf Of Jodi Schneider [[log in to unmask]]
Sent: Monday, October 14, 2013 11:22 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] pdf2txt
Of the document you write, the key part for this discussion seems to be
Some suggested ways to make the scanned information accessible in a
> seamless manner are:
> - The catalogue records to have two links one to the actual document
> and the other to a search page that enables searching all the documents.
> The second link could be something lile “Click here to go to the full-text
> search for departmental reports”.
> - An easy to use (user friendly) full-text search interface.
> So are you asking how to make a full-text search interface using the OCR
results from Eric's tool?
This doesn't at all answer your question, but gives a pointer to OCR
"Case Study: Using Perl and CGI Scripts to Automate a Quality Control
Workflow for Scanned Congressional Documents"
I think if you ask a more particular question (that doesn't rely on reading
your draft), you might get a better answer.
On Mon, Oct 14, 2013 at 6:48 AM, Penelope Campbell <
[log in to unmask]> wrote:
> Dear Eric,
> Thanks for this.
> As a small special library (solo librarian) in an Australian State
> Government Department I use DB/Text works which has a feature of
> importing documents so that the full text can be read. It though only
> imports the full-text not what you have done which is really great. I
> wrote a small piece (see attached) explaining what I am in the process
> of doing. I am using the library catalogue records as metadata. But I
> am hoping for something more. I do really want to open up the
> collection and make the information discoverable more than just the
> Library catalogue . I had contacted Juame Nualart who wrote a paper on
> some ways to present terms called Texty.
> http://informationr.net/ir/18-2/paper581.html But it is not a piece of
> software. I am quite interested in what you have done. I am just tyring
> to work out a way to show relevancy and this may be something I could
> integrate into the Library catalogue.
> I hope you can take the time to reply to me.
> Thank you
> Penelope Campbell | Library Manager
> Department of Family and Community Services | Housing NSW
> T 02 8753 8732 | F 02 8753 8734
> A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131
> A Locked bag 4001 Ashfield BC NSW, 1800
> E [log in to unmask]
> W www.housing.nsw.gov.au
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Eric Lease Morgan
> Sent: Saturday, 12 October 2013 2:16 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] pdf2txt
> For a limited period of time I am making publicly available a Web-based
> program called PDF2TXT -- http://bit.ly/1bJRyh8
> PDF2TXT extracts the text from an OCRed PDF document and then does some
> rudimentary "distant reading" against the text in the form of word
> clouds, readability scores, concordance features, and "maps"
> (histograms) illustrating where terms appear in a text.
> Here is the idea behind the application:
> 1. In the Libraries I see people scanning, scanning, and
> scanning. I suppose these people then go home and read the
> document. They might even print it. These documents are long.
> Moreover, I'll bet they have multiple documents.
> 2. Text mining requires digitized text, but PDF documents are
> frequently full of formatting. At the same time, they often
> have the text underneath. Our scanning software does OCR.
> 3. By extracting the text from PDF documents, I can facilitate
> a different -- additional -- type of analysis against sets of
> one or more documents. PDF2TXT is the first step in this
> What is really cool is that PDF2TXT works for many of the articles
> downloadable from the Libraries's article indexes. Search an article
> index. Download a full text, PDF version of the article. Feed it to
> PDF2TXT. Get more out of your article.
> PDF2TXT currently has "creeping featuritis" -- meaning that it is
> growing in weird directions. Your feedback is more than welcome. (I
> know. The output is ugly.) Also, please be gentle with it because it
> does not process things the size of the Bible.
> Eric Lease Morgan
> Digital Initiatives Librarian
> University of Notre Dame
> Room 131, Hesburgh Libraries
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [log in to unmask]<mailto:[log in to unmask]>
> Security Statement
> This email may be confidential and contain privileged information. If you
> are not the intended recipient you must not use, disclose, copy or
> distribute this email, including any attachments. Confidentiality and legal
> privilege attached to this communication are not waived or lost by reason
> of mistaken delivery to you. If you have received the email in error please
> delete and notify the sender. Any views expressed in this email are those
> of the author and do not necessarily represent those of the department,
> except where the sender expressly, and with authority, states them to be
> the views of the Department of Family and Community Services NSW. The
> department does not represent, warrant or guarantee that the integrity of
> this email has been maintained, or that the communication is free of error,
> virus, interception, inference or interference.