Print

Print


Eric -

How did you index the files?    These are, I presume (based on "body" mention below) these are HTML files?    Can you send along a file (direct to me is fine) and how you indexed it and I'll take a look.

	Erik


> On Aug 14, 2018, at 10:49 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> 
> How to I go about indexing & searching Chinese text using Solr?
> 
> I have a pile o' simplified Chinese text encoded in UTF-8. Taking hints from some Solr documentation [1], I have configured my index thusly:
> 
>  <schema name="hui" version="1.6">
>    <uniqueKey>key</uniqueKey>
> 
>    <!-- local field types -->
>    <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>
>    <fieldType name="long" class="solr.TrieLongField" positionIncrementGap="0" docValues="true" precisionStep="0"/>
> 
>    <!-- chinese indexing configuration happens here -->
>    <fieldType name="text_general" class="solr.TextField">
>      <analyzer>
>        <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>        <filter class="solr.CJKWidthFilterFactory"/>
>        <filter class="solr.StopFilterFactory"
>          words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>        <filter class="solr.PorterStemFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>   </fieldType>
> 
>    <!-- good to have fields -->
>    <field name="_root_"    type="string"       docValues="false"   indexed="true" stored="false"/>
>    <field name="_text_"    type="text_general" multiValued="true"  indexed="true" stored="false"/>
>    <field name="_version_" type="long"         indexed="true"      stored="false"/>
> 
>    <!-- my fields -->
>    <field name="fulltext" type="text_general"  multiValued="false"  indexed="true"   stored="true" />
>    <field name="key"      type="text_general"  multiValued="false"  indexed="true"   stored="true"  />
> 
>    <!-- cool field -->
>    <copyField source="*" dest="_text_"/>
> 
>  </schema>
> 
> I then index my content, and as per the Solr Admin interface, my index includes 130 documents. I then have problems searching:
> 
>  * Using the Solr Admin interface, I can search for everything (*:*), and all my results are returned, but the Chinese characters are all mangled.
> 
>  * Using a Lynx (terminal) interface, I can search for everything (*:*), and all my results are returned, but the Chinese characters are all mangled.
> 
>  * Using a Perl interface of my own design, I can search for everything (*:*), and get results and the characters are NOT mangled.
> 
>  * Using the same Perl interface, I try to enter a query using Chinese characters, but I always get zero results.
> 
>  * Using the same Perl interface, I can search for the word "body" (an HTML element I didn't delete), I get expected results.
> 
>  * Using the same Perl interface, I an enter a query using the mangled characters, and I get the sorts of results I expect.
> 
> I believe I have indexed my documents "correctly", but I can't seem to query the index in the expected manner. What might I be doing wrong?
> 
> [1] Solr documentation - https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-TraditionalChinese
> 
> -- 
> Eric Lease Morgan
> Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
> Hesburgh Libraries
> 
> University of Notre Dame
> 250E Hesburgh Library
> Notre Dame, IN 46556
> o: 574-631-8604
> w: cds.library.nd.edu