LISTSERV 16.5 - CODE4LIB Archives

On Oct 1, 2012, at 7:55 PM, Bess Sadler wrote:

> For a full-text search system we're prototyping, we are being asked to provide term co-occurrence analysis. I'm not very familiar with this concept, so maybe someone on the list can describe it better, but I believe that what is wanted is to be able to query a text corpus for a given word, and to receive in return a list of words that co-occur with the search term, along with some indication of how often those words co-occur. Something like this IBM Many Eyes demo: http://ibm.co/PT5N59 (but we're not necessarily looking for a visualization, just a way to do the query).


Interesting, but alas, I do not have a Solr recipe.

Yes, co-occurance -- along with ngram -- are terms used to denote and infer the distance between different tokens (usually words) in a text. Successful phrase searching assumes some sort of underlying co-occurance algorithm because the indexing process assigns positions to its indexed tokens (words).

An algorithm creating a list co-occurances is an almost trivial:

  1. read text

  2. parse text into individual tokens
     (words)

  3. read a token and the next token

  4. update a list (hash or "associative
     array") with the pair of tokens

  5. update a list of the number of times
     this particular pair of tokens exist

  6. go to step #3 for each token

  7. done

If one wants to read more than bigrams (two-word phrases), then change Step #3 to include the next token and the token after that one -- trigrams.

Given a particular token, listing the co-occurances of that token and other words is "simply" a matter of searching the list for the token and returning it and its co-occurance. 

How to do this in Solr? After re-reading the documentation I believe the ord function may be of some use, but I'm not sure:

  ord(myfield) returns the ordinal of the indexed field value
  within the indexed list of terms for that field in lucene index
  order (lexicographically ordered by unicode value), starting at
  1. In other words, for a given field, all values are ordered
  lexicographically; this function then returns the offset of a
  particular value in that ordering. The field must have a maximum
  of one value per document (not multiValued). 0 is returned for
  documents without a value in the field.
  
    * Example: If there were only three values for a particular field:
      "apple","banana","pear", then ord("apple")=1, ord("banana")=2,
      ord("pear")=3
    * Example Syntax: ord(myIndexedField)
    * Example SolrQuerySyntax: _val_:"ord(myIndexedField)"

  http://bit.ly/SiL8eM

I have a hammer (Perl) and everything to me looks like a nail. Consequently I would use a module I wrote to do this sort of thing -- http://bit.ly/bgmhXM -- specifically the ngram method.

-- 
HTH, ELM