Overview

To be able to search the text efficiently and effectively, Solr (mostly Lucene actually) splits the text into tokens during indexing as well as during query (search). Those tokens can also be pre- and post-filtered for additional flexibility. This allows for things like case-insensitive search, misspelt product names, synonyms, and so on.

To achieve all this flexibility, Solr comes quite a variety of methods to manipulate the text. Understanding what filters and tokenizers are available and what they actually do is a major stumbling block for new Solr users. This page provides a comprehensive overview of all the classes that can be used in Solr, together with the link to their Javadoc pages.

Most of the analyzers, tokenizers and filters are located in lucene-analyzers-common-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ ), so any entry without a location indicated can be found in that jar.

Note: all of this is only applicable to the text fields with fieldType's class solr.TextField. If your fieldType's class is solr.StrField, it does not get analyzed (similar to using plain KeywordTokenizerFactory).

Non-chainable analysers

The set below are the analyzers that are standalone. They take in text and out comes a sequence of tokens. The same analyzer is used during indexing and during search. Many of these come from Lucene itself. Only analyzers that can be used by Solr are listed here. Lucene has some other analyzers that cannot be used directly because they have non-standard initialization requirements.

<fieldType name="text_greek" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
</fieldType>

Analyzer in lucene-core-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

AnalyzerWrapper in lucene-core-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

ShingleAnalyzerWrapper

ChineseAnalyzer

DutchAnalyzer

KeywordAnalyzer

MorfologikAnalyzer in lucene-analyzers-morfologik-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

SimpleAnalyzer

SmartChineseAnalyzer in lucene-analyzers-smartcn-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

StopwordAnalyzerBase

ArabicAnalyzer

ArmenianAnalyzer

BasqueAnalyzer

BrazilianAnalyzer

BulgarianAnalyzer

CatalanAnalyzer

CJKAnalyzer

ClassicAnalyzer

CzechAnalyzer

DanishAnalyzer

EnglishAnalyzer

FinnishAnalyzer

FrenchAnalyzer

GalicianAnalyzer

GermanAnalyzer

GreekAnalyzer

HindiAnalyzer

HungarianAnalyzer

IndonesianAnalyzer

IrishAnalyzer

ItalianAnalyzer

JapaneseAnalyzer in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

LatvianAnalyzer

NorwegianAnalyzer

PersianAnalyzer

PolishAnalyzer in lucene-analyzers-stempel-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

PortugueseAnalyzer

RomanianAnalyzer

RussianAnalyzer

SoraniAnalyzer

SpanishAnalyzer

StandardAnalyzer

StopAnalyzer

SwedishAnalyzer

ThaiAnalyzer

TurkishAnalyzer

UAX29URLEmailAnalyzer

WhitespaceAnalyzer

Chainable tokenizers and filters

A more flexible approach than a single all-encompassing tokenizer is to chain and configure some tokenizers and filters together to fit particular customer requirements. Solr allows to have up to three type of components in the chain:

Character filters
These are optional and operate on the original text (before tokens). They can change the text in any way imaginable by adding, removing or transforming characters. There could be none, one, or many of these filters and they operate in the sequence defined
Tokenizer
There can only be one of these and its presence is compulsory. The tokenizer takes the text stream and splits out a sequence of tokens with their positions. Actually, it is more complicated, as the output is actually a graph, but most of the time we can think of it as a sequence
Token filters
These filters are also optional and they work similar to character filters, but on individual tokens. They can change tokens, remove them or add additional ones. They output tokens, so naturally, they can also be chained

Character filters

CharFilterFactory

HTMLStripCharFilterFactory

ICUNormalizer2CharFilterFactory (multi) in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

JapaneseIterationMarkCharFilterFactory (multi) in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

LegacyHTMLStripCharFilterFactory

MappingCharFilterFactory (multi)

PatternReplaceCharFilterFactory

PersianCharFilterFactory (multi)

Tokenizers

TokenizerFactory

ArabicLetterTokenizerFactory

ChineseTokenizerFactory

CJKTokenizerFactory

ClassicTokenizerFactory

EdgeNGramTokenizerFactory

HMMChineseTokenizerFactory in lucene-analyzers-smartcn-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

ICUTokenizerFactory in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

JapaneseTokenizerFactory in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

KeywordTokenizerFactory

LetterTokenizerFactory

LowerCaseTokenizerFactory (multi)

NGramTokenizerFactory

PathHierarchyTokenizerFactory

PatternTokenizerFactory

RussianLetterTokenizerFactory

SmartChineseSentenceTokenizerFactory in lucene-analyzers-smartcn-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

StandardTokenizerFactory

ThaiTokenizerFactory

UAX29URLEmailTokenizerFactory

UIMAAnnotationsTokenizerFactory in lucene-analyzers-uima-4.9.0.jar ( contrib/uima/lucene-libs/ )

UIMATypeAwareAnnotationsTokenizerFactory in lucene-analyzers-uima-4.9.0.jar ( contrib/uima/lucene-libs/ )

WhitespaceTokenizerFactory

WikipediaTokenizerFactory

Token filters

TokenFilterFactory

ApostropheFilterFactory

ArabicNormalizationFilterFactory (multi)

ArabicStemFilterFactory

ASCIIFoldingFilterFactory (multi)

BaseManagedTokenFilterFactory

ManagedStopFilterFactory

ManagedSynonymFilterFactory

BeiderMorseFilterFactory in lucene-analyzers-phonetic-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

BrazilianStemFilterFactory

BulgarianStemFilterFactory

CapitalizationFilterFactory

ChineseFilterFactory

CJKBigramFilterFactory

CJKWidthFilterFactory (multi)

ClassicFilterFactory

CodepointCountFilterFactory

CollationKeyFilterFactory (multi)

CommonGramsFilterFactory

CommonGramsQueryFilterFactory

CzechStemFilterFactory

DelimitedPayloadTokenFilterFactory

DictionaryCompoundWordTokenFilterFactory

DoubleMetaphoneFilterFactory in lucene-analyzers-phonetic-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

EdgeNGramFilterFactory

ElisionFilterFactory (multi)

EnglishMinimalStemFilterFactory

EnglishPossessiveFilterFactory

FinnishLightStemFilterFactory

FrenchLightStemFilterFactory

FrenchMinimalStemFilterFactory

GalicianMinimalStemFilterFactory

GalicianStemFilterFactory

GermanLightStemFilterFactory

GermanMinimalStemFilterFactory

GermanNormalizationFilterFactory (multi)

GermanStemFilterFactory

GreekLowerCaseFilterFactory (multi)

GreekStemFilterFactory

HindiNormalizationFilterFactory (multi)

HindiStemFilterFactory

HungarianLightStemFilterFactory

HunspellStemFilterFactory

HyphenatedWordsFilterFactory

HyphenationCompoundWordTokenFilterFactory

ICUCollationKeyFilterFactory (multi) in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

ICUFoldingFilterFactory (multi) in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

ICUNormalizer2FilterFactory (multi) in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

ICUTransformFilterFactory (multi) in lucene-analyzers-icu-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

IndicNormalizationFilterFactory (multi)

IndonesianStemFilterFactory

IrishLowerCaseFilterFactory (multi)

ItalianLightStemFilterFactory

JapaneseBaseFormFilterFactory in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

JapaneseKatakanaStemFilterFactory in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

JapanesePartOfSpeechStopFilterFactory in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

JapaneseReadingFormFilterFactory in lucene-analyzers-kuromoji-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

KeepWordFilterFactory

KeywordMarkerFilterFactory

KeywordRepeatFilterFactory

KStemFilterFactory

LatvianStemFilterFactory

LengthFilterFactory

LimitTokenCountFilterFactory

LimitTokenPositionFilterFactory

LowerCaseFilterFactory (multi)

MorfologikFilterFactory in lucene-analyzers-morfologik-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

NGramFilterFactory

NorwegianLightStemFilterFactory

NorwegianMinimalStemFilterFactory

NumericPayloadTokenFilterFactory

PatternCaptureGroupFilterFactory

PatternReplaceFilterFactory

PersianNormalizationFilterFactory (multi)

PhoneticFilterFactory in lucene-analyzers-phonetic-4.9.0.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )

PorterStemFilterFactory

PortugueseLightStemFilterFactory

PortugueseMinimalStemFilterFactory

PortugueseStemFilterFactory

PositionFilterFactory

RemoveDuplicatesTokenFilterFactory

ReversedWildcardFilterFactory

ReverseStringFilterFactory

RussianLightStemFilterFactory

ScandinavianFoldingFilterFactory

ScandinavianNormalizationFilterFactory

ShingleFilterFactory

SmartChineseWordTokenFilterFactory in lucene-analyzers-smartcn-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

SnowballPorterFilterFactory

SoraniNormalizationFilterFactory (multi)

SoraniStemFilterFactory

SpanishLightStemFilterFactory

StandardFilterFactory

StemmerOverrideFilterFactory

StempelPolishStemFilterFactory in lucene-analyzers-stempel-4.9.0.jar ( contrib/analysis-extras/lucene-libs/ )

StopFilterFactory

SwedishLightStemFilterFactory

SynonymFilterFactory

ThaiWordFilterFactory

TokenOffsetPayloadTokenFilterFactory

TrimFilterFactory

TruncateTokenFilterFactory

TurkishLowerCaseFilterFactory (multi)

TypeAsPayloadTokenFilterFactory

TypeTokenFilterFactory

UpperCaseFilterFactory (multi)

WordDelimiterFilterFactory

Analyzer chain types

In Solr, the text is analyzed twice: once when it gets indexed and once it gets queried (searched).

It's possible to define the same chain for both of these phases

<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
	<charFilter class="solr.PersianCharFilterFactory"/>
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.ArabicNormalizationFilterFactory"/>
	<filter class="solr.PersianNormalizationFilterFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt" />
  </analyzer>
</fieldType>

Alternatively, the analyzis and query chains can be different

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Finally, there is a third - usually hidden - chain type, which is used for multiterm analysis (queries like term* and [term1..term2]). The reason it is hidden is because it is usually automatically constructed from the explicitly defined chain by only using components that are mutiterm-aware. They are marked with (multi) in the list above. The primary use case is to ensure that case-insensitive matches work as expected even when wildcards are used. You can read more complete explanation in the Solr Wiki.

To use it, add <analyzer type="multiterm"> section next to the index and query sections in the analyzer chain definition.

Short Names

Notice that most of Analyzer, Tokenizer and Filter factories can be referenced by shortname such as:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
Only non-core components require full class name, including package name.


Previous versions of this document

You can also find archive versions of this document for version 4.8.0, and version 4.7.0

Subscribe to Solr Start news and updates:

* indicates required