Hi Eric, If you're pretty sure you indexed the characters properly and are getting garbage no matter what you do, my first thought is that this is a localization issue. Can you cat/grep/sed/vi/whatever these characters in a terminal window? If not, that is at least part of your problem. Running locale-gen en_US.UTF-8 may help. <rant>Why this hasn't been a default for many years is ridiculous, but I digress</rant> If you can type/read these characters in a terminal window, the problem is downstream. In that case, I'd verify the Accept-Encoding headers and keep working your way down until you hit the index. kyle On Tue, Aug 14, 2018 at 7:50 AM Eric Lease Morgan <[log in to unmask]> wrote: > How to I go about indexing & searching Chinese text using Solr? > > I have a pile o' simplified Chinese text encoded in UTF-8. Taking hints > from some Solr documentation [1], I have configured my index thusly: > > <schema name="hui" version="1.6"> > <uniqueKey>key</uniqueKey> > > <!-- local field types --> > <fieldType name="string" class="solr.StrField" sortMissingLast="true" > docValues="true"/> > <fieldType name="long" class="solr.TrieLongField" > positionIncrementGap="0" docValues="true" precisionStep="0"/> > > <!-- chinese indexing configuration happens here --> > <fieldType name="text_general" class="solr.TextField"> > <analyzer> > <tokenizer class="solr.HMMChineseTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.StopFilterFactory" > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > <filter class="solr.PorterStemFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <!-- good to have fields --> > <field name="_root_" type="string" docValues="false" > indexed="true" stored="false"/> > <field name="_text_" type="text_general" multiValued="true" > indexed="true" stored="false"/> > <field name="_version_" type="long" indexed="true" > stored="false"/> > > <!-- my fields --> > <field name="fulltext" type="text_general" multiValued="false" > indexed="true" stored="true" /> > <field name="key" type="text_general" multiValued="false" > indexed="true" stored="true" /> > > <!-- cool field --> > <copyField source="*" dest="_text_"/> > > </schema> > > I then index my content, and as per the Solr Admin interface, my index > includes 130 documents. I then have problems searching: > > * Using the Solr Admin interface, I can search for everything (*:*), and > all my results are returned, but the Chinese characters are all mangled. > > * Using a Lynx (terminal) interface, I can search for everything (*:*), > and all my results are returned, but the Chinese characters are all mangled. > > * Using a Perl interface of my own design, I can search for everything > (*:*), and get results and the characters are NOT mangled. > > * Using the same Perl interface, I try to enter a query using Chinese > characters, but I always get zero results. > > * Using the same Perl interface, I can search for the word "body" (an > HTML element I didn't delete), I get expected results. > > * Using the same Perl interface, I an enter a query using the mangled > characters, and I get the sorts of results I expect. > > I believe I have indexed my documents "correctly", but I can't seem to > query the index in the expected manner. What might I be doing wrong? > > [1] Solr documentation - > https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-TraditionalChinese > > -- > Eric Lease Morgan > Digital Initiatives Librarian, Navari Family Center for Digital Scholarship > Hesburgh Libraries > > University of Notre Dame > 250E Hesburgh Library > Notre Dame, IN 46556 > o: 574-631-8604 > w: cds.library.nd.edu >