Eric Lease Morgan, you are awesome. Thank you so much for your enlightening explanation, with source code, even!
code4lib is the BEST!
Bess
On Oct 1, 2012, at 7:39 PM, Eric Lease Morgan wrote:
> On Oct 1, 2012, at 7:55 PM, Bess Sadler wrote:
>
>> For a full-text search system we're prototyping, we are being asked to provide term co-occurrence analysis. I'm not very familiar with this concept, so maybe someone on the list can describe it better, but I believe that what is wanted is to be able to query a text corpus for a given word, and to receive in return a list of words that co-occur with the search term, along with some indication of how often those words co-occur. Something like this IBM Many Eyes demo: http://ibm.co/PT5N59 (but we're not necessarily looking for a visualization, just a way to do the query).
>
>
> Interesting, but alas, I do not have a Solr recipe.
>
> Yes, co-occurance -- along with ngram -- are terms used to denote and infer the distance between different tokens (usually words) in a text. Successful phrase searching assumes some sort of underlying co-occurance algorithm because the indexing process assigns positions to its indexed tokens (words).
>
> An algorithm creating a list co-occurances is an almost trivial:
>
> 1. read text
>
> 2. parse text into individual tokens
> (words)
>
> 3. read a token and the next token
>
> 4. update a list (hash or "associative
> array") with the pair of tokens
>
> 5. update a list of the number of times
> this particular pair of tokens exist
>
> 6. go to step #3 for each token
>
> 7. done
>
> If one wants to read more than bigrams (two-word phrases), then change Step #3 to include the next token and the token after that one -- trigrams.
>
> Given a particular token, listing the co-occurances of that token and other words is "simply" a matter of searching the list for the token and returning it and its co-occurance.
>
> How to do this in Solr? After re-reading the documentation I believe the ord function may be of some use, but I'm not sure:
>
> ord(myfield) returns the ordinal of the indexed field value
> within the indexed list of terms for that field in lucene index
> order (lexicographically ordered by unicode value), starting at
> 1. In other words, for a given field, all values are ordered
> lexicographically; this function then returns the offset of a
> particular value in that ordering. The field must have a maximum
> of one value per document (not multiValued). 0 is returned for
> documents without a value in the field.
>
> * Example: If there were only three values for a particular field:
> "apple","banana","pear", then ord("apple")=1, ord("banana")=2,
> ord("pear")=3
> * Example Syntax: ord(myIndexedField)
> * Example SolrQuerySyntax: _val_:"ord(myIndexedField)"
>
> http://bit.ly/SiL8eM
>
> I have a hammer (Perl) and everything to me looks like a nail. Consequently I would use a module I wrote to do this sort of thing -- http://bit.ly/bgmhXM -- specifically the ngram method.
>
> --
> HTH, ELM
|