Eric Lease Morgan, you are awesome. Thank you so much for your enlightening explanation, with source code, even! code4lib is the BEST! Bess On Oct 1, 2012, at 7:39 PM, Eric Lease Morgan wrote: > On Oct 1, 2012, at 7:55 PM, Bess Sadler wrote: > >> For a full-text search system we're prototyping, we are being asked to provide term co-occurrence analysis. I'm not very familiar with this concept, so maybe someone on the list can describe it better, but I believe that what is wanted is to be able to query a text corpus for a given word, and to receive in return a list of words that co-occur with the search term, along with some indication of how often those words co-occur. Something like this IBM Many Eyes demo: http://ibm.co/PT5N59 (but we're not necessarily looking for a visualization, just a way to do the query). > > > Interesting, but alas, I do not have a Solr recipe. > > Yes, co-occurance -- along with ngram -- are terms used to denote and infer the distance between different tokens (usually words) in a text. Successful phrase searching assumes some sort of underlying co-occurance algorithm because the indexing process assigns positions to its indexed tokens (words). > > An algorithm creating a list co-occurances is an almost trivial: > > 1. read text > > 2. parse text into individual tokens > (words) > > 3. read a token and the next token > > 4. update a list (hash or "associative > array") with the pair of tokens > > 5. update a list of the number of times > this particular pair of tokens exist > > 6. go to step #3 for each token > > 7. done > > If one wants to read more than bigrams (two-word phrases), then change Step #3 to include the next token and the token after that one -- trigrams. > > Given a particular token, listing the co-occurances of that token and other words is "simply" a matter of searching the list for the token and returning it and its co-occurance. > > How to do this in Solr? After re-reading the documentation I believe the ord function may be of some use, but I'm not sure: > > ord(myfield) returns the ordinal of the indexed field value > within the indexed list of terms for that field in lucene index > order (lexicographically ordered by unicode value), starting at > 1. In other words, for a given field, all values are ordered > lexicographically; this function then returns the offset of a > particular value in that ordering. The field must have a maximum > of one value per document (not multiValued). 0 is returned for > documents without a value in the field. > > * Example: If there were only three values for a particular field: > "apple","banana","pear", then ord("apple")=1, ord("banana")=2, > ord("pear")=3 > * Example Syntax: ord(myIndexedField) > * Example SolrQuerySyntax: _val_:"ord(myIndexedField)" > > http://bit.ly/SiL8eM > > I have a hammer (Perl) and everything to me looks like a nail. Consequently I would use a module I wrote to do this sort of thing -- http://bit.ly/bgmhXM -- specifically the ngram method. > > -- > HTH, ELM