Eric,
Your solution will have other effects while it performs Spanish language
Porter stemming, which you may or may not want depending on your use case.
I think you can accomplish what you want by using ICUFoldingFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
which should simply perform ICU (cf http://site.icu-project.org/) based
character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
In schema.xml I generally have in both index and query:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory" />
That should take care of many search issues relating to diacritics and
accents. For example, I wanted to have Łódź and Lodz index and search
identically, and this does that.
f you are using Tomcat, you might also want to set up the URIENcoding. See
https://wiki.apache.org/solr/SolrTomcat and the line on that page
including <Connector ... URIEncoding="UTF-8"/>
For example it might be like:
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />
By the way, I also wanted to have ö, ä, and ü index and query the same as
oe, ae, and ue because those are very common variants in German terms
rendered in English texts. The only way I could figure out how to
accomplish that was to use a charFilter by creating a file named
mapping-GermanUmlauts.txt
containing
"ae" => "a"
"oe" => "o"
"ue" => "u"
and then I added this after the filter class=solr.ICUFoldingFilterFactory:
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-GermanUmlauts.txt"/>
I hope this is helpful.
---------- Forwarded message ----------
From: Eric Lease Morgan <[log in to unmask]>
Date: Mon, Feb 16, 2015 at 4:58 PM
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics,
resolved (i think) ]
To: [log in to unmask]
I know the documents I’m indexing are written in Spanish, and adding the
following filters to my field definition, I believe I have resolved my
problem:
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
In other words, my searchable content is defined thus:
<field name=“text" type="text_general" indexed="true" stored="true"
multiValued="false" />
And “text_general” is defined to include the filters in both the index and
query sections:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
</analyzer>
</fieldType>
|