Print

Print


Thanks! I saw this post earlier, but was curious if there were other solutions for this problem.
I guess I need to dig further into other implementations of the class Similarity.
________________________________________
From: Code for Libraries [[log in to unmask]] On Behalf Of Chris Fitzpatrick [[log in to unmask]]
Sent: Wednesday, September 25, 2013 7:57 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] solr computation field norm problem

Yeah...I think you're running into this:

http://lucene.472066.n3.nabble.com/field-length-normalization-tp495308p495311.html

TL;DR:
Jay Hill says fields with 3 terms and 4 terms both score at .5 in the
lengthNorm.







On Wed, Sep 25, 2013 at 4:21 PM, Nicolas Franck <[log in to unmask]>wrote:

> Hi there,
>
> I have a question about the way Lucene computes the length norm of field
> norm for its documents.
> My documents are indexed using Solr.
> These are the documents that where indexed (ignore 'score', that is not
> part of the document itself)
>
> <doc>
>   <float name="score">1.00711</float>
>   <str name="_id">ejn01:2560000000075596</str>
>   <str name="title">Journal of neurology research</str>
> </doc>
> <doc>
>   <float name="score">1.00711</float>
>   <str name="_id">ejn01:954925518616</str>
>   <str name="title">Journal of neurology</str>
> </doc>
>
>
> The field "title" has the following definition in schema.xml:
>
> <fieldType name="utf8text" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false">
>   <analyzer type="index">
>     <tokenizer class="solr.StandardTokenizerFactory"
> maxTokenLength="1024"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.ASCIIFoldingFilterFactory"/>
>     <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" format="solr" ignoreCase="false"
> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.StandardTokenizerFactory"
> maxTokenLength="1024"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.ASCIIFoldingFilterFactory"/>
>     <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" format="solr" ignoreCase="false"
> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
>
> If I use the query "journal of neurology", both documents have the same
> score, although the second document is more exact. Supplying a phrase query
> does not fix the issue. I also see that the computed fieldNorm is "0.5" for
> both documents. Does this have something to do with the loss of precision
> when storing the length norm into one byte?
>
> These are all the supplied parameters (defaults in solrconfig.xml):
>
> <str name="lowercaseOperators">false</str>
> <str name="mm">-10%</str>
> <str name="pf">author^3 title^2</str>
> <str name="sort">score desc</str>
> <arr name="bq">
>   <str>source:ser01^10</str>
>   <str>source:ejn01^10</str>
>  <str>(*:* -type:article)^999</str>
> </arr>
> <str name="echoParams">all</str>
> <str name="df">all</str>
> <str name="tie">0</str>
> <str name="qf">
> author^15 title^10 subject^1 summary^1 library^1 location^1 publisher^1
> place_published^1 issn^1 isbn^1
> </str>
> <str name="q.alt">*:*</str>
> <str name="ps">2</str>
> <str name="defType">edismax</str>
> <str name="q">journal of neurology</str>
> <str name="echoParams">all</str>
> <str name="sort">score desc</str>
>
> Looking the computation of the score, I see no single difference between
> them (see down below)
> Any idea why the fieldNorm is the same for both documents?
>
>
> Thanks in advance!
>
> Greetings,
>
> Nicolas
>
>
>
>
> <str name="ejn01:2560000000075596">
> 1.0071099 = (MATCH) sum of:
>   0.0053001107 = (MATCH) sum of:
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:journal^10.0 in 0), product of:
>         0.005943145 = queryWeight(title:journal^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:journal in 0), product of:
>           1.0 = tf(termFreq(title:journal)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=0)
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:of^10.0 in 0), product of:
>         0.005943145 = queryWeight(title:of^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:of in 0), product of:
>           1.0 = tf(termFreq(title:of)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=0)
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:neurology^10.0 in 0), product of:
>         0.005943145 = queryWeight(title:neurology^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:neurology in 0), product of:
>           1.0 = tf(termFreq(title:neurology)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=0)
>   0.0031800664 = (MATCH) max of:
>     0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in
> 0), product of:
>       0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0),
> product of:
>         2.0 = boost
>         1.7836046 = idf(title: journal=2 of=2 neurology=2)
>         9.996294E-4 = queryNorm
>       0.8918023 = fieldWeight(title:"journal of neurology" in 0), product
> of:
>         1.0 = tf(phraseFreq=1.0)
>         1.7836046 = idf(title: journal=2 of=2 neurology=2)
>         0.5 = fieldNorm(field=title, doc=0)
>   0.99862975 = (MATCH) sum of:
>     0.99862975 = (MATCH) MatchAllDocsQuery, product of:
>       0.99862975 = queryNorm
> </str>
> <str name="ejn01:954925518616">
> 1.0071099 = (MATCH) sum of:
>   0.0053001107 = (MATCH) sum of:
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:journal^10.0 in 1), product of:
>         0.005943145 = queryWeight(title:journal^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:journal in 1), product of:
>           1.0 = tf(termFreq(title:journal)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=1)
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:of^10.0 in 1), product of:
>         0.005943145 = queryWeight(title:of^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:of in 1), product of:
>           1.0 = tf(termFreq(title:of)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=1)
>     0.0017667036 = (MATCH) max of:
>       0.0017667036 = (MATCH) weight(title:neurology^10.0 in 1), product of:
>         0.005943145 = queryWeight(title:neurology^10.0), product of:
>           10.0 = boost
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           9.996294E-4 = queryNorm
>         0.29726744 = (MATCH) fieldWeight(title:neurology in 1), product of:
>           1.0 = tf(termFreq(title:neurology)=1)
>           0.5945349 = idf(docFreq=2, maxDocs=2)
>           0.5 = fieldNorm(field=title, doc=1)
>   0.0031800664 = (MATCH) max of:
>     0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in
> 1), product of:
>       0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0),
> product of:
>         2.0 = boost
>         1.7836046 = idf(title: journal=2 of=2 neurology=2)
>         9.996294E-4 = queryNorm
>       0.8918023 = fieldWeight(title:"journal of neurology" in 1), product
> of:
>         1.0 = tf(phraseFreq=1.0)
>         1.7836046 = idf(title: journal=2 of=2 neurology=2)
>         <b>0.5 = fieldNorm(field=title, doc=1)
>   0.99862975 = (MATCH) sum of:
>     0.99862975 = (MATCH) MatchAllDocsQuery, product of:
>       0.99862975 = queryNorm
> </str>
>