Print

Print


Eric,

Shlomo Argamon and Mark Olsen have done some related work on text classification.  You may have been at DHCS for their paper analyzing differences in word use by male and female authors, for example.[1]  There are bibliographies from the IIT Linguistic Cognition Laboratory and the ARTFL project which may give you some ideas for additional experiments. [2,3] 

-Tod

[1] http://digitalhumanities.org/dhq/vol/3/2/000042/000042.html
[2] http://lingcog.iit.edu/pub_year.xml
[3] http://code.google.com/p/philomine/wiki/Bibliography

Tod Olson <[log in to unmask]>
Systems Librarian
University of Chicago Library

On Feb 22, 2011, at 6:02 PM, Eric Lease Morgan wrote:

> On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote:
> 
>> It's not ironic - my post was musing inspired by your work.  I guess I wasn't sure if I understood your results. You were looking at the overall POS usage in the entire texts as a possible way of ranking the texts. I was wondering about POS of particular search terms - those that could take on several POS....
> 
> 
> Initially I wanted to see if I could classify works based on their POS usage. [1] I was hoping to find lots of action verbs in one work and call it an action story. I was hoping to find lots of nouns in another story and call it... I don't know, something else. Instead, after rudimentary investigation, I discovered that all of of the works I analyzed had the same relative percentage of nouns, pronouns, verbs, adverbs, adjectives, etc. Maybe such a thing is indicative of the English language.
> 
> On the other hand, I did notice a difference in the use of particular pronouns between works. In Walden by Thoreau, a story about an individual living on the banks of a "pond", there was a lot of use of the word "I", but in a different story, where the author and his brother canoe down a river, the word "we" predominated. Similarly, three Jane Austen stories have many words like "she" and "her" where those words are less frequent in the works by Thoreau. While my analysis was trivial and thin, I think we might be able to classify some works by gender or speaking voice. 
> 
> Similar things may be possible with other parts-of-speech, like adjectives, specifically colors. For example 214 of the 117,540 words in Walden (0.18%) are colors  [1] But only 13  of 121,917 words in Pride and Prejudice (0.01%) are color words. Despite the similar lengths of the works, Walden is 18 times more "colorful" than Pride. Interesting? This only begs other questions. Is 0.18% a high value or a low value? Is the relative use of colors similar within a particular author or not? Has the use of color changed over time or indicative of genres? Does the use of specific colors actually denote mood?
> 
> In the past libraries did not have a whole lot of full text in order to evaluate content. That is not true now-a-days. It is now possible to literally count and measure a book's characteristics. Since this metadata is numeric in nature, it lends itself to visualization. (Think Karen C's presentation at Code4Lib.) And this whole thing is good fodder for search, discovery, and evaluation. Too much of our metadata is qualitative.
> 
> 
> [1] foray's into POS - http://bit.ly/aM2eZx
> [2] color words in Walden - http://t.co/hlg5ibL
> [3] color words in Pride - http://t.co/VflNf3n
> 
> -- 
> Eric Lease Morgan