Chinese Indexing in Solr

Karan Jeet Singh | Lead Solutions Architect

Karan Jeet Singh

Lead Solutions Architect

Managed Search Solr

Blog > Managed Search Solr > Chinese Indexing in Solr

July 03, 2019

Chinese Indexing in Solr

Karan Jeet Singh | Lead Solutions Architect

Some of our SearchStax clients index websites that use multiple languages. We were recently asked how to enable Solr indexing of Mandarin on a cloud platform. (This post describes indexing Traditional Chinese characters. It is also possible to use Simplified Chinese by following a similar series of steps. Contact us at support@searchstax.com for an example.)

Solr does not parse Chinese text by default, but it comes with the appropriate tokenizers included. The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described below).

Step 1: Obtain Configuration Files.

To add Traditional Chinese indexing to your Solr project, you need to modify your project configuration files. If you need to download the files from an existing project, see How can I view my Zookeeper Configurations?

Step 2. Add the Required Library.

Update solrconfig.xml file by adding following line after all the lib declarations.

				
					<!-- Traditional Chinese library -->
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu-\d.*\.jar" />
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" regex="icu4j-\d.*\.jar" />
<!-- Traditional Chinese library - END -->

This library comes with Solr, so you don’t have to alter your deployment in any way to make it work.

Step 3. Update the Schema

A. Create a new field type in the managed-schema file with the SmartChineseAnalyzer.

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

B. Create a field that uses this field type.

Step 4: Upload Configuration and Reload Collection

Upload the altered configuration to your SearchStax cloud server and reload your collection. See How do I update the Solr Schema? for step-by-step instructions.