How do I retain diacritics in a Solr index, and how to I search for words containing them?
I have extracted the plain text out of set of Word documents. I have then used a Perl interface (WebService::Solr) to add the plain text to a Solr index using a field type called text_general:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
It seems as if I am unable to search for words like ejecución because the diacritic gets in the way. What am I doing wrong?
—
Eric
|