Dear Eric,
Thanks for this.
As a small special library (solo librarian) in an Australian State
Government Department I use DB/Text works which has a feature of
importing documents so that the full text can be read. It though only
imports the full-text not what you have done which is really great. I
wrote a small piece (see attached) explaining what I am in the process
of doing. I am using the library catalogue records as metadata. But I
am hoping for something more. I do really want to open up the
collection and make the information discoverable more than just the
Library catalogue . I had contacted Juame Nualart who wrote a paper on
some ways to present terms called Texty.
http://informationr.net/ir/18-2/paper581.html But it is not a piece of
software. I am quite interested in what you have done. I am just tyring
to work out a way to show relevancy and this may be something I could
integrate into the Library catalogue.
I hope you can take the time to reply to me.
Thank you
Penelope Campbell | Library Manager
Department of Family and Community Services | Housing NSW
T 02 8753 8732 | F 02 8753 8734
A Ground Floor, 223-239 Liverpool Road Ashfield NSW, 2131
A Locked bag 4001 Ashfield BC NSW, 1800
E [log in to unmask]
W www.housing.nsw.gov.au
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Eric Lease Morgan
Sent: Saturday, 12 October 2013 2:16 AM
To: [log in to unmask]
Subject: [CODE4LIB] pdf2txt
For a limited period of time I am making publicly available a Web-based
program called PDF2TXT -- http://bit.ly/1bJRyh8
PDF2TXT extracts the text from an OCRed PDF document and then does some
rudimentary "distant reading" against the text in the form of word
clouds, readability scores, concordance features, and "maps"
(histograms) illustrating where terms appear in a text.
Here is the idea behind the application:
1. In the Libraries I see people scanning, scanning, and
scanning. I suppose these people then go home and read the
document. They might even print it. These documents are long.
Moreover, I'll bet they have multiple documents.
2. Text mining requires digitized text, but PDF documents are
frequently full of formatting. At the same time, they often
have the text underneath. Our scanning software does OCR.
3. By extracting the text from PDF documents, I can facilitate
a different -- additional -- type of analysis against sets of
one or more documents. PDF2TXT is the first step in this
process.
What is really cool is that PDF2TXT works for many of the articles
downloadable from the Libraries's article indexes. Search an article
index. Download a full text, PDF version of the article. Feed it to
PDF2TXT. Get more out of your article.
PDF2TXT currently has "creeping featuritis" -- meaning that it is
growing in weird directions. Your feedback is more than welcome. (I
know. The output is ugly.) Also, please be gentle with it because it
does not process things the size of the Bible.
--
[cid:116F6092-2AB6-4E95-8199-25639542726A]
Eric Lease Morgan
Digital Initiatives Librarian
University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: [log in to unmask]<mailto:[log in to unmask]>
[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
==========================================================
Security Statement
This email may be confidential and contain privileged information. If you are not the intended recipient you must not use, disclose, copy or distribute this email, including any attachments. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of mistaken delivery to you. If you have received the email in error please delete and notify the sender. Any views expressed in this email are those of the author and do not necessarily represent those of the department, except where the sender expressly, and with authority, states them to be the views of the Department of Family and Community Services NSW. The department does not represent, warrant or guarantee that the integrity of this email has been maintained, or that the communication is free of error, virus, interception, inference or interference.
==========================================================
|