Search

July 03, 2019

Chinese Indexing in Solr

Karan Jeet Singh | Lead Solutions Architect

Lead Solutions Architect

July 03, 2019

Chinese Indexing in Solr

Karan Jeet Singh | Lead Solutions Architect

In this article

In this article

Share this on:

Some of our SearchStax clients index websites that use multiple languages. We were recently asked how to enable Solr indexing of Mandarin on a cloud platform. (This post describes indexing Traditional Chinese characters. It is also possible to use Simplified Chinese by following a similar series of steps. Contact us at support@searchstax.com for an example.)

Solr does not parse Chinese text by default, but it comes with the appropriate tokenizers included. The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described below).

"This makes it clear that marketing should fully own the digital experience - starting from when a student lands on the website to explore and first learn about offerings all the way through collecting their cap and gown."

Step 1: Obtain Configuration Files.

To add Traditional Chinese indexing to your Solr project, you need to modify your project configuration files. If you need to download the files from an existing project, see How can I view my Zookeeper Configurations?

Step 2. Add the Required Library.

Update solrconfig.xml file by adding following line after all the lib declarations.

				
					<!-- Traditional Chinese library -->
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu-\d.*\.jar" />
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" regex="icu4j-\d.*\.jar" />
<!-- Traditional Chinese library - END -->
				
			

This library comes with Solr, so you don’t have to alter your deployment in any way to make it work.

Step 3. Update the Schema

A. Create a new field type in the managed-schema file with the SmartChineseAnalyzer.

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

B. Create a field that uses this field type.

<field name=”text_man” type=”text_mandarin” multiValued=”true” indexed=”true” stored=”true”/>

Step 4: Upload Configuration and Reload Collection

Upload the altered configuration to your SearchStax cloud server and reload your collection. See How do I update the Solr Schema? for step-by-step instructions.

Karan Jeet Singh
|
Lead Solutions Architect

Karan is a scrappy Solutions guy focused on solving onboarding challenges and easing the product adoption challenges for customers.

You might also like

Showing Slide 1 of 4