LISTSERV 16.5 - CODE4LIB Archives

On Jul 1, 2014, at 9:12 AM, Katie <[log in to unmask]> wrote:

> Has anyone here experience in the world of natural language programming (while applying information retrieval techniques)? 
> 
> I'm currently trying to develop a tool that will:
> 
>   1. take a pdf and extract the text (paying no attention to images or formatting)
>   2. analyze the text via term weighting, inverse document frequency, and other natural language processing techniques
>   3. assemble a list of suggested terms and concepts that are weighted heavily in that document
> 
> Step 1 is straightforward and I've had much success there. Step 2 is the problem child. I've played around with a few APIs (like AlchemyAPI) but they have character length limitations or other shortcomings that keep me looking. 
> 
> The background behind this project is that I work for a digital library with a large pre-existing collection of pdfs with rudimentary metadata. The aforementioned tool will be used to classify and group the pdfs according to the themes of the library. Our CMS is Drupal so depending on my level of ambition, this *might* develop into a module.  
> 
> Does this sound like a project that has been done/attempted before? Any suggested tools or reading materials?


You have, more or less, just described my job. Increasingly, I:

  * create or are given a list of citations
  * save the citations as a computer-readable list (database)
  * harvest the full text of the each cited item
  * extract the plain text from the harvested PDF file
  * clean up / post-process the text, maybe
  * do analysis against individual texts or the entire corpus
  * provide interfaces to “read” the corpus from “a distance”

The analysis is akin to descriptive statistics but for “bags of words”. I create lists of both frequently use as well as statistically significant words/phrases. I do parts-of-speach (POS) analysis and create lists of nouns, verbs, adjectives, etc. I then create more lists of the frequently used and significant POS. I sometimes do sentiment analysis (or alternative called “opinion mining”) against the corpus. Sometimes I index the whole lot and provide a search interface. Through named-entity extraction I pull out names of people, places, and things. The meaning of these things can then be elaborated upon through Wikipedia look-ups. The dates and be plotted on a timeline. I’m beginning to get into classification and clustering, but I haven’t seen any really exciting things come out of topic modeling, yet. Through all of these processes, I am able to supplement the original lists of citations to value-added services. What I’m weak at is the visualizations. 

Example projects have included:

  * harvesting “poverty tourism” websites, and learning
    how & why people are convinced to visit slums

  * collecting as many articles from the history of science
    literature as possible, and analyzing how the use of
    word “practice” has changed over time

  * similarly, collecting as many articles from the business
    section of the New York Times to determine how the words
    “tariff” and “trade” have changed over time

  * analyzing how people’s perceptions of culture have
    changed based on pre- and post-descriptions of China

  * collecting and analyzing the transcripts of trials during
    the 17th century to see whether how religion affected commerce

  * finding the common themes in a set of 4th century Catholic
    hymns

  * looking for alternative genres in a corpus of mediaeval
    literature

Trying to determine the significant words of a single document in iscolation is difficult. Instead, it is much easier to denote a set of significant words for a single document when the document is a part of a corpus. There seems to be never-ending and ever subtle differences on how to do this, but exploiting TF/IDF is probably one of the more common. [1] Consider also using the cosine similarity measure to compare documents for “sameness”. [2] The folks at Stanford have a very nice suite of natural language processors. [3] Albeit written in Perl, I have created a tiny library of routines and corresponding programs do much of this work from the command line of my desktop computer. [4]

[1] TF/IDF - http://en.wikipedia.org/wiki/Tf–idf
[2] similarity - http://en.wikipedia.org/wiki/Cosine_similarity
[3] Stanford tools - http://www-nlp.stanford.edu
[4] tiny library - https://github.com/ericleasemorgan/Tiny-Text-Mining-Tools

—
Eric “Librarians Love Lists" Morgan