LISTSERV 16.5 - CODE4LIB Archives

Eric,

Your solution will have other effects while it performs Spanish language
Porter stemming, which you may or may not want depending on your use case.

I think you can accomplish what you want by using ICUFoldingFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

which should simply perform ICU (cf http://site.icu-project.org/) based
character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)

In schema.xml I generally have in both index and query:

    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory" />

That should take care of many search issues relating to diacritics and
accents. For example, I wanted to have Łódź and Lodz index and search
identically, and this does that.

f you are using Tomcat, you might also want to set up the URIENcoding. See
https://wiki.apache.org/solr/SolrTomcat and the line  on that page
including <Connector ... URIEncoding="UTF-8"/>

For example it might be like:
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />

By the way, I also wanted to have ö, ä, and ü index and query the same as
oe, ae, and ue because those are very common variants in German terms
rendered in English texts. The only way I could figure out how to
accomplish that was to use a charFilter by creating a file named
mapping-GermanUmlauts.txt
containing
"ae" => "a"
"oe" => "o"
"ue" => "u"
and then I added this after the filter class=solr.ICUFoldingFilterFactory:
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-GermanUmlauts.txt"/>

I hope this is helpful.

---------- Forwarded message ----------
From: Eric Lease Morgan <[log in to unmask]>
Date: Mon, Feb 16, 2015 at 4:58 PM
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics,
resolved (i think) ]
To: [log in to unmask]


I know the documents I’m indexing are written in Spanish, and adding the
following filters to my field definition, I believe I have resolved my
problem:

  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />

In other words, my searchable content is defined thus:

  <field name=“text" type="text_general" indexed="true" stored="true"
multiValued="false" />

And “text_general” is defined to include the filters in both the index and
query sections:

  <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
      <filter class="solr.LowerCaseFilterFactory" />
      <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
    </analyzer>
  </fieldType>