One of the difficulties with your surface analysis of Thoreau vs Austen is that Thoreau wrote a memoir and Austen wrote fictional narrative. If the texts were available, it might be interesting to see how something like Bridget Jones compares. It will clearly have a lot of female 3rd person in it, but it will also have a lot of 1st person. On Tue, Feb 22, 2011 at 19:02, Eric Lease Morgan <[log in to unmask]> wrote: > On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote: > > > It's not ironic - my post was musing inspired by your work. I guess I > wasn't sure if I understood your results. You were looking at the overall > POS usage in the entire texts as a possible way of ranking the texts. I was > wondering about POS of particular search terms - those that could take on > several POS.... > > > Initially I wanted to see if I could classify works based on their POS > usage. [1] I was hoping to find lots of action verbs in one work and call it > an action story. I was hoping to find lots of nouns in another story and > call it... I don't know, something else. Instead, after rudimentary > investigation, I discovered that all of of the works I analyzed had the same > relative percentage of nouns, pronouns, verbs, adverbs, adjectives, etc. > Maybe such a thing is indicative of the English language. > > On the other hand, I did notice a difference in the use of particular > pronouns between works. In Walden by Thoreau, a story about an individual > living on the banks of a "pond", there was a lot of use of the word "I", but > in a different story, where the author and his brother canoe down a river, > the word "we" predominated. Similarly, three Jane Austen stories have many > words like "she" and "her" where those words are less frequent in the works > by Thoreau. While my analysis was trivial and thin, I think we might be able > to classify some works by gender or speaking voice. > > Similar things may be possible with other parts-of-speech, like adjectives, > specifically colors. For example 214 of the 117,540 words in Walden (0.18%) > are colors [1] But only 13 of 121,917 words in Pride and Prejudice (0.01%) > are color words. Despite the similar lengths of the works, Walden is 18 > times more "colorful" than Pride. Interesting? This only begs other > questions. Is 0.18% a high value or a low value? Is the relative use of > colors similar within a particular author or not? Has the use of color > changed over time or indicative of genres? Does the use of specific colors > actually denote mood? > > In the past libraries did not have a whole lot of full text in order to > evaluate content. That is not true now-a-days. It is now possible to > literally count and measure a book's characteristics. Since this metadata is > numeric in nature, it lends itself to visualization. (Think Karen C's > presentation at Code4Lib.) And this whole thing is good fodder for search, > discovery, and evaluation. Too much of our metadata is qualitative. > > > [1] foray's into POS - http://bit.ly/aM2eZx > [2] color words in Walden - http://t.co/hlg5ibL > [3] color words in Pride - http://t.co/VflNf3n > > -- > Eric Lease Morgan >