LISTSERV 16.5 - CODE4LIB Archives

How do I retain diacritics in a Solr index, and how to I search for words containing them?

I have extracted the plain text out of set of Word documents. I have then used a Perl interface (WebService::Solr) to add the plain text to a Solr index using a field type called text_general:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

It seems as if I am unable to search for words like ejecución because the diacritic gets in the way. What am I doing wrong?

— 
Eric