LISTSERV 16.5 - CODE4LIB Archives

Ah, the wonderful world of character encoding...

To quote the Solr wiki:
There are no known bugs with Solr's character handling, but there have been some reported issues with the way different application servers (and different versions of the same application server) treat incoming and outgoing multibyte characters. In particular, people have reported better success with Tomcat than with Jetty... (https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F )

I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves the issue. 

If not, I'd check the original files to see what its character encoding is, and then check each application that handles the documents to make sure it's using that encoding. It might be that the original isn't in UTF-8, or if it is, that somewhere along the way the parser, the perl interface, or some other unknown culprit is attempting to change it.

Regards,
Karl Holten
Systems Integration Specialist
SWITCH Inc
414-382-6711

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Lease Morgan
Sent: Thursday, February 12, 2015 2:38 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics]

How do I retain diacritics in a Solr index, and how to I search for words containing them?

I have extracted the plain text out of set of Word documents. I have then used a Perl interface (WebService::Solr) to add the plain text to a Solr index using a field type called text_general:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

It seems as if I am unable to search for words like ejecución because the diacritic gets in the way. What am I doing wrong?

— 
Eric