LISTSERV 16.5 - CODE4LIB Archives

Eric Lease Morgan, you are awesome. Thank you so much for your enlightening explanation, with source code, even! 

code4lib is the BEST!

Bess

On Oct 1, 2012, at 7:39 PM, Eric Lease Morgan wrote:

> On Oct 1, 2012, at 7:55 PM, Bess Sadler wrote:
> 
>> For a full-text search system we're prototyping, we are being asked to provide term co-occurrence analysis. I'm not very familiar with this concept, so maybe someone on the list can describe it better, but I believe that what is wanted is to be able to query a text corpus for a given word, and to receive in return a list of words that co-occur with the search term, along with some indication of how often those words co-occur. Something like this IBM Many Eyes demo: http://ibm.co/PT5N59 (but we're not necessarily looking for a visualization, just a way to do the query).
> 
> 
> Interesting, but alas, I do not have a Solr recipe.
> 
> Yes, co-occurance -- along with ngram -- are terms used to denote and infer the distance between different tokens (usually words) in a text. Successful phrase searching assumes some sort of underlying co-occurance algorithm because the indexing process assigns positions to its indexed tokens (words).
> 
> An algorithm creating a list co-occurances is an almost trivial:
> 
>  1. read text
> 
>  2. parse text into individual tokens
>     (words)
> 
>  3. read a token and the next token
> 
>  4. update a list (hash or "associative
>     array") with the pair of tokens
> 
>  5. update a list of the number of times
>     this particular pair of tokens exist
> 
>  6. go to step #3 for each token
> 
>  7. done
> 
> If one wants to read more than bigrams (two-word phrases), then change Step #3 to include the next token and the token after that one -- trigrams.
> 
> Given a particular token, listing the co-occurances of that token and other words is "simply" a matter of searching the list for the token and returning it and its co-occurance. 
> 
> How to do this in Solr? After re-reading the documentation I believe the ord function may be of some use, but I'm not sure:
> 
>  ord(myfield) returns the ordinal of the indexed field value
>  within the indexed list of terms for that field in lucene index
>  order (lexicographically ordered by unicode value), starting at
>  1. In other words, for a given field, all values are ordered
>  lexicographically; this function then returns the offset of a
>  particular value in that ordering. The field must have a maximum
>  of one value per document (not multiValued). 0 is returned for
>  documents without a value in the field.
> 
>    * Example: If there were only three values for a particular field:
>      "apple","banana","pear", then ord("apple")=1, ord("banana")=2,
>      ord("pear")=3
>    * Example Syntax: ord(myIndexedField)
>    * Example SolrQuerySyntax: _val_:"ord(myIndexedField)"
> 
>  http://bit.ly/SiL8eM
> 
> I have a hammer (Perl) and everything to me looks like a nail. Consequently I would use a module I wrote to do this sort of thing -- http://bit.ly/bgmhXM -- specifically the ngram method.
> 
> -- 
> HTH, ELM