LISTSERV 16.5 - CODE4LIB Archives

On Jul 22, 2017, at 8:47 AM, Andreas Orphanides <[log in to unmask]> wrote:

>> How might I write an SQL query to find all the descriptions of good kings
>> and bad kings in my database?
>> 
>> I have a table that looks something like this:
>> 
>>  key  word       part_of_speech  position
>>  ---  ---------  --------------  --------
>>  w1   The        article         1
>>  w2   righteous  adjective       2
>>  w3   king       noun            3
>>  w4   solved     verb            4
>>  w5   all        adjective       5
>>  w6   the        article         6
>>  w7   problems   noun            7
>>  w8   Your       noun            8
>>  w9   evil       adjective       9
>>  w10  king       noun            10
>>  w11  over       adverb          11
>>  w12  taxes      verb            12
>>  w13  us         pronoun         13
> 
> In a perfect world, your parser would include the referents for the
> adjectives, which would solve the problem lickety split, but I assume
> that's not an option here. Are you ok with only discovering adjective-noun
> pairs?...
> 
> Add a sequentially numbered autogen column that's pure integer. (Unless you
> have string manipulation -- in which case you could just work on the "key"
> column).
> 
> Then something like:
> 
> SELECT * FROM table WHERE autogen IN (SELECT (autogen-1) FROM table WHERE
> word = "king") AND part_of_speech=“adjective"


Resolved, and the secret sauce was the SQL qualifier IN.

To find all positions of the word “king”, I can do this:

  SELECT position FROM words WHERE word=“king”;

To find all the positions prior to the word “king”, I can do this:

  SELECT position-1 FROM words WHERE word=“king”;

I can then use the IN qualifier to use the results of the previous query as input for a subsequent query. Thus I can find all the words prior to the word “king”;

  SELECT word
  FROM words
  WHERE position IN (SELECT position-1 FROM words WHERE word=“king”);

I can then further limit my query to only list adjectives:

  SELECT word
  FROM words
  WHERE part_of_speach=“adjective”
  AND position IN (SELECT position-1 FROM words WHERE word=“king”);

I can then count and group the results as an ordered list:

  SELECT COUNT(word) AS c, word
  FROM words
  WHERE part_of_speach=“adjective”
  AND position IN (SELECT position-1 FROM words WHERE word=“king”)
  GROUP BY word
  ORDER BY c DESC;

Since my database includes the words from my entire corpus (60,000,000 words), I need to further limit my query to document identifiers. Moreover, my parts-of-speech are not as simple as “noun”, “verb”, and “adjective”. Without going into too much detail, my cool query ultimately looks like this:

  SELECT COUNT(word) AS c, word
  FROM words
  WHERE did='A05406' AND pos LIKE 'j%'
  AND (position IN (SELECT position-1 FROM words WHERE (word='king' AND did='A05406')))
  GROUP by word
  ORDER by c DESC;

In this particular case, kings are described as both earthly and good. 

I can then search my index for documents of interest, and then I can feed my SQL query an identifier as well as a few words. For example, a humorous document on the topic of love describes men and women: [1]

  SELECT COUNT(word) AS c, word FROM words WHERE did='A13520' AND
  pos LIKE 'j%' AND (position IN (SELECT position-1 FROM words
  WHERE((LOWER(word)='man' OR LOWER(word)='men' OR
  LOWER(word)='woman' OR LOWER(word)='women') AND did='A13520')))
  GROUP BY word ORDER BY c DESC;

  c  word
  =  =========
  5  old
  5  young
  4  poore
  3  good
  2  Old
  1  Good
  1  Great
  1  Rich
  1  Seruing
  1  best
  1  great
  1  labouring
  1  little
  1  painfull
  1  proper
  1  singing
  1  wise

Apparently, when it comes to love, age doesn’t matter, men & women are poor, and at the same time, they are not very wise.  ;-)

Fun with SQL. Thank you. Code4lib++

[1] humorous document - http://cds.crc.nd.edu/freebo/A13/A13520/A13520.htm

—
Eric Morgan