Print

Print


Hi Eric,

If you're pretty sure you indexed the characters properly and are getting
garbage no matter what you do, my first thought is that this is a
localization issue. Can you cat/grep/sed/vi/whatever these characters in a
terminal window?

If not, that is at least part of your problem. Running

locale-gen en_US.UTF-8

may help. <rant>Why this hasn't been a default for many years is
ridiculous, but I digress</rant> If you can type/read these characters in a
terminal window, the problem is downstream. In that case, I'd verify the
Accept-Encoding headers and keep working your way down until you hit the
index.

kyle

On Tue, Aug 14, 2018 at 7:50 AM Eric Lease Morgan <[log in to unmask]> wrote:

> How to I go about indexing & searching Chinese text using Solr?
>
> I have a pile o' simplified Chinese text encoded in UTF-8. Taking hints
> from some Solr documentation [1], I have configured my index thusly:
>
>   <schema name="hui" version="1.6">
>     <uniqueKey>key</uniqueKey>
>
>     <!-- local field types -->
>     <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> docValues="true"/>
>     <fieldType name="long" class="solr.TrieLongField"
> positionIncrementGap="0" docValues="true" precisionStep="0"/>
>
>     <!-- chinese indexing configuration happens here -->
>     <fieldType name="text_general" class="solr.TextField">
>       <analyzer>
>         <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <filter class="solr.StopFilterFactory"
>           words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>    </fieldType>
>
>     <!-- good to have fields -->
>     <field name="_root_"    type="string"       docValues="false"
>  indexed="true" stored="false"/>
>     <field name="_text_"    type="text_general" multiValued="true"
> indexed="true" stored="false"/>
>     <field name="_version_" type="long"         indexed="true"
> stored="false"/>
>
>     <!-- my fields -->
>     <field name="fulltext" type="text_general"  multiValued="false"
> indexed="true"   stored="true" />
>     <field name="key"      type="text_general"  multiValued="false"
> indexed="true"   stored="true"  />
>
>     <!-- cool field -->
>     <copyField source="*" dest="_text_"/>
>
>   </schema>
>
> I then index my content, and as per the Solr Admin interface, my index
> includes 130 documents. I then have problems searching:
>
>   * Using the Solr Admin interface, I can search for everything (*:*), and
> all my results are returned, but the Chinese characters are all mangled.
>
>   * Using a Lynx (terminal) interface, I can search for everything (*:*),
> and all my results are returned, but the Chinese characters are all mangled.
>
>   * Using a Perl interface of my own design, I can search for everything
> (*:*), and get results and the characters are NOT mangled.
>
>   * Using the same Perl interface, I try to enter a query using Chinese
> characters, but I always get zero results.
>
>   * Using the same Perl interface, I can search for the word "body" (an
> HTML element I didn't delete), I get expected results.
>
>   * Using the same Perl interface, I an enter a query using the mangled
> characters, and I get the sorts of results I expect.
>
> I believe I have indexed my documents "correctly", but I can't seem to
> query the index in the expected manner. What might I be doing wrong?
>
> [1] Solr documentation -
> https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-TraditionalChinese
>
> --
> Eric Lease Morgan
> Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
> Hesburgh Libraries
>
> University of Notre Dame
> 250E Hesburgh Library
> Notre Dame, IN 46556
> o: 574-631-8604
> w: cds.library.nd.edu
>