How to I go about indexing & searching Chinese text using Solr?
I have a pile o' simplified Chinese text encoded in UTF-8. Taking hints from some Solr documentation [1], I have configured my index thusly:
<schema name="hui" version="1.6">
<uniqueKey>key</uniqueKey>
<!-- local field types -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>
<fieldType name="long" class="solr.TrieLongField" positionIncrementGap="0" docValues="true" precisionStep="0"/>
<!-- chinese indexing configuration happens here -->
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- good to have fields -->
<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="false"/>
<!-- my fields -->
<field name="fulltext" type="text_general" multiValued="false" indexed="true" stored="true" />
<field name="key" type="text_general" multiValued="false" indexed="true" stored="true" />
<!-- cool field -->
<copyField source="*" dest="_text_"/>
</schema>
I then index my content, and as per the Solr Admin interface, my index includes 130 documents. I then have problems searching:
* Using the Solr Admin interface, I can search for everything (*:*), and all my results are returned, but the Chinese characters are all mangled.
* Using a Lynx (terminal) interface, I can search for everything (*:*), and all my results are returned, but the Chinese characters are all mangled.
* Using a Perl interface of my own design, I can search for everything (*:*), and get results and the characters are NOT mangled.
* Using the same Perl interface, I try to enter a query using Chinese characters, but I always get zero results.
* Using the same Perl interface, I can search for the word "body" (an HTML element I didn't delete), I get expected results.
* Using the same Perl interface, I an enter a query using the mangled characters, and I get the sorts of results I expect.
I believe I have indexed my documents "correctly", but I can't seem to query the index in the expected manner. What might I be doing wrong?
[1] Solr documentation - https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-TraditionalChinese
--
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
w: cds.library.nd.edu
|