Overview

To be able to search the text efficiently and effectively, Solr (mostly Lucene actually) splits the text into tokens during indexing as well as during query (search). Those tokens can also be pre- and post-filtered for additional flexibility. This allows for things like case-insensitive search, misspelt product names, synonyms, and so on.

To achieve all this flexibility, Solr comes quite a variety of methods to manipulate the text. Understanding what filters and tokenizers are available and what they actually do is a major stumbling block for new Solr users. This page provides a comprehensive overview of all the classes that can be used in Solr, together with the link to their Javadoc pages.

Most of the analyzers, tokenizers and filters are located in lucene-analyzers-common-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ ), so any entry without a location indicated can be found in that jar.

Note: all of this is only applicable to the text fields with fieldType's class solr.TextField. If your fieldType's class is solr.StrField, it does not get analyzed (similar to using plain KeywordTokenizerFactory).

Non-chainable analysers

The set below are the analyzers that are standalone. They take in text and out comes a sequence of tokens. The same analyzer is used during indexing and during search. Many of these come from Lucene itself. Only analyzers that can be used by Solr are listed here. Lucene has some other analyzers that cannot be used directly because they have non-standard initialization requirements.

<fieldType name="text_greek" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
</fieldType>

Analyzer in lucene-core-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
An Analyzer builds TokenStreams, which analyze text.

AnalyzerWrapper in lucene-core-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Extension to Analyzer suitable for Analyzers which wrap other Analyzers.

ShingleAnalyzerWrapper
A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.

ChineseAnalyzer
An Analyzer that tokenizes text with ChineseTokenizer and filters with ChineseFilter

DutchAnalyzer
Analyzer for Dutch language.

KeywordAnalyzer
"Tokenizes" the entire stream as a single token.

MorfologikAnalyzer in lucene-analyzers-morfologik-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
org.apache.lucene.analysis.Analyzer using Morfologik library.

SimpleAnalyzer
An Analyzer that filters LetterTokenizer with LowerCaseFilter

SmartChineseAnalyzer in lucene-analyzers-smartcn-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.

StopwordAnalyzerBase
Base class for Analyzers that need to make use of stopword sets.

ArabicAnalyzer
Analyzer for Arabic.

ArmenianAnalyzer
Analyzer for Armenian.

BasqueAnalyzer
Analyzer for Basque.

BrazilianAnalyzer
Analyzer for Brazilian Portuguese language.

BulgarianAnalyzer
Analyzer for Bulgarian.

CatalanAnalyzer
Analyzer for Catalan.

CJKAnalyzer
An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

ClassicAnalyzer
Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

CzechAnalyzer
Analyzer for Czech language.

DanishAnalyzer
Analyzer for Danish.

EnglishAnalyzer
Analyzer for English.

FinnishAnalyzer
Analyzer for Finnish.

FrenchAnalyzer
Analyzer for French language.

GalicianAnalyzer
Analyzer for Galician.

GermanAnalyzer
Analyzer for German language.

GreekAnalyzer
Analyzer for the Greek language.

HindiAnalyzer
Analyzer for Hindi.

HungarianAnalyzer
Analyzer for Hungarian.

IndonesianAnalyzer
Analyzer for Indonesian (Bahasa)

IrishAnalyzer
Analyzer for Irish.

ItalianAnalyzer
Analyzer for Italian.

JapaneseAnalyzer in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Analyzer for Japanese that uses morphological analysis.

LatvianAnalyzer
Analyzer for Latvian.

NorwegianAnalyzer
Analyzer for Norwegian.

PersianAnalyzer
Analyzer for Persian.

PolishAnalyzer in lucene-analyzers-stempel-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Analyzer for Polish.

PortugueseAnalyzer
Analyzer for Portuguese.

RomanianAnalyzer
Analyzer for Romanian.

RussianAnalyzer
Analyzer for Russian language.

SoraniAnalyzer
Analyzer for Sorani Kurdish.

SpanishAnalyzer
Analyzer for Spanish.

StandardAnalyzer
Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

StopAnalyzer
Filters LetterTokenizer with LowerCaseFilter and StopFilter.

SwedishAnalyzer
Analyzer for Swedish.

ThaiAnalyzer
Analyzer for Thai language.

TurkishAnalyzer
Analyzer for Turkish.

UAX29URLEmailAnalyzer
Filters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with org.apache.lucene.analysis.standard.StandardFilter, org.apache.lucene.analysis.core.LowerCaseFilter and org.apache.lucene.analysis.core.StopFilter, using a list of English stop words.

WhitespaceAnalyzer
An Analyzer that uses WhitespaceTokenizer.

Chainable tokenizers and filters

A more flexible approach than a single all-encompassing tokenizer is to chain and configure some tokenizers and filters together to fit particular customer requirements. Solr allows to have up to three type of components in the chain:

Character filters
These are optional and operate on the original text (before tokens). They can change the text in any way imaginable by adding, removing or transforming characters. There could be none, one, or many of these filters and they operate in the sequence defined
Tokenizer
There can only be one of these and its presence is compulsory. The tokenizer takes the text stream and splits out a sequence of tokens with their positions. Actually, it is more complicated, as the output is actually a graph, but most of the time we can think of it as a sequence
Token filters
These filters are also optional and they work similar to character filters, but on individual tokens. They can change tokens, remove them or add additional ones. They output tokens, so naturally, they can also be chained

Character filters

CharFilterFactory
Abstract parent class for analysis factories that create CharFilter instances.

HTMLStripCharFilterFactory
Factory for HTMLStripCharFilter.

ICUNormalizer2CharFilterFactory (multi) in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUNormalizer2CharFilter

JapaneseIterationMarkCharFilterFactory (multi) in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter.

LegacyHTMLStripCharFilterFactory in solr-core-4.10.1.jar ( dist/ )
Factory for LegacyHTMLStripCharFilter.

MappingCharFilterFactory (multi)
Factory for MappingCharFilter.

PatternReplaceCharFilterFactory
Factory for PatternReplaceCharFilter.

PersianCharFilterFactory (multi)
Factory for PersianCharFilter.

Tokenizers

TokenizerFactory
Abstract parent class for analysis factories that create Tokenizer instances.

ArabicLetterTokenizerFactory
Factory for ArabicLetterTokenizer

ChineseTokenizerFactory
Factory for ChineseTokenizer

CJKTokenizerFactory
Factory for CJKTokenizer.

ClassicTokenizerFactory
Factory for ClassicTokenizer.

EdgeNGramTokenizerFactory
Creates new instances of EdgeNGramTokenizer.

HMMChineseTokenizerFactory in lucene-analyzers-smartcn-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for HMMChineseTokenizer

ICUTokenizerFactory in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUTokenizer.

JapaneseTokenizerFactory in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for org.apache.lucene.analysis.ja.JapaneseTokenizer.

KeywordTokenizerFactory
Factory for KeywordTokenizer.

LetterTokenizerFactory
Factory for LetterTokenizer.

LowerCaseTokenizerFactory (multi)
Factory for LowerCaseTokenizer.

NGramTokenizerFactory
Factory for NGramTokenizer.

PathHierarchyTokenizerFactory
Factory for PathHierarchyTokenizer.

PatternTokenizerFactory
Factory for PatternTokenizer.

RussianLetterTokenizerFactory

SmartChineseSentenceTokenizerFactory in lucene-analyzers-smartcn-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for the SmartChineseAnalyzer SentenceTokenizer

StandardTokenizerFactory
Factory for StandardTokenizer.

ThaiTokenizerFactory
Factory for ThaiTokenizer.

UAX29URLEmailTokenizerFactory
Factory for UAX29URLEmailTokenizer.

UIMAAnnotationsTokenizerFactory in lucene-analyzers-uima-4.10.1.jar ( contrib/uima/lucene-libs/ )
org.apache.lucene.analysis.util.TokenizerFactory for UIMAAnnotationsTokenizer

UIMATypeAwareAnnotationsTokenizerFactory in lucene-analyzers-uima-4.10.1.jar ( contrib/uima/lucene-libs/ )
org.apache.lucene.analysis.util.TokenizerFactory for UIMATypeAwareAnnotationsTokenizer

WhitespaceTokenizerFactory
Factory for WhitespaceTokenizer.

WikipediaTokenizerFactory
Factory for WikipediaTokenizer.

Token filters

TokenFilterFactory
Abstract parent class for analysis factories that create org.apache.lucene.analysis.TokenFilter instances.

ApostropheFilterFactory
Factory for ApostropheFilter.

ArabicNormalizationFilterFactory (multi)
Factory for ArabicNormalizationFilter.

ArabicStemFilterFactory
Factory for ArabicStemFilter.

ASCIIFoldingFilterFactory (multi)
Factory for ASCIIFoldingFilter.

BaseManagedTokenFilterFactory in solr-core-4.10.1.jar ( dist/ )
Abstract based class for implementing TokenFilterFactory objects that are managed by the REST API.

ManagedStopFilterFactory in solr-core-4.10.1.jar ( dist/ )
TokenFilterFactory that uses the ManagedWordSetResource implementation for managing stop words using the REST API.

ManagedSynonymFilterFactory in solr-core-4.10.1.jar ( dist/ )
TokenFilterFactory and ManagedResource implementation for doing CRUD on synonyms using the REST API.

BeiderMorseFilterFactory in lucene-analyzers-phonetic-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for BeiderMorseFilter.

BrazilianStemFilterFactory
Factory for BrazilianStemFilter.

BulgarianStemFilterFactory
Factory for BulgarianStemFilter.

CapitalizationFilterFactory
Factory for CapitalizationFilter.

ChineseFilterFactory
Factory for ChineseFilter

CJKBigramFilterFactory
Factory for CJKBigramFilter.

CJKWidthFilterFactory (multi)
Factory for CJKWidthFilter.

ClassicFilterFactory
Factory for ClassicFilter.

CodepointCountFilterFactory
Factory for CodepointCountFilter.

CollationKeyFilterFactory (multi)
Factory for CollationKeyFilter.

CommonGramsFilterFactory
Constructs a CommonGramsFilter.

CommonGramsQueryFilterFactory
Construct CommonGramsQueryFilter.

CzechStemFilterFactory
Factory for CzechStemFilter.

DelimitedPayloadTokenFilterFactory
Factory for DelimitedPayloadTokenFilter.

DictionaryCompoundWordTokenFilterFactory
Factory for DictionaryCompoundWordTokenFilter.

DoubleMetaphoneFilterFactory in lucene-analyzers-phonetic-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for DoubleMetaphoneFilter.

EdgeNGramFilterFactory
Creates new instances of EdgeNGramTokenFilter.

ElisionFilterFactory (multi)
Factory for ElisionFilter.

EnglishMinimalStemFilterFactory
Factory for EnglishMinimalStemFilter.

EnglishPossessiveFilterFactory
Factory for EnglishPossessiveFilter.

FinnishLightStemFilterFactory
Factory for FinnishLightStemFilter.

FrenchLightStemFilterFactory
Factory for FrenchLightStemFilter.

FrenchMinimalStemFilterFactory
Factory for FrenchMinimalStemFilter.

GalicianMinimalStemFilterFactory
Factory for GalicianMinimalStemFilter.

GalicianStemFilterFactory
Factory for GalicianStemFilter.

GermanLightStemFilterFactory
Factory for GermanLightStemFilter.

GermanMinimalStemFilterFactory
Factory for GermanMinimalStemFilter.

GermanNormalizationFilterFactory (multi)
Factory for GermanNormalizationFilter.

GermanStemFilterFactory
Factory for GermanStemFilter.

GreekLowerCaseFilterFactory (multi)
Factory for GreekLowerCaseFilter.

GreekStemFilterFactory
Factory for GreekStemFilter.

HindiNormalizationFilterFactory (multi)
Factory for HindiNormalizationFilter.

HindiStemFilterFactory
Factory for HindiStemFilter.

HungarianLightStemFilterFactory
Factory for HungarianLightStemFilter.

HunspellStemFilterFactory
TokenFilterFactory that creates instances of HunspellStemFilter.

HyphenatedWordsFilterFactory
Factory for HyphenatedWordsFilter.

HyphenationCompoundWordTokenFilterFactory
Factory for HyphenationCompoundWordTokenFilter.

ICUCollationKeyFilterFactory (multi) in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUCollationKeyFilter.

ICUFoldingFilterFactory (multi) in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUFoldingFilter.

ICUNormalizer2FilterFactory (multi) in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUNormalizer2Filter

ICUTransformFilterFactory (multi) in lucene-analyzers-icu-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for ICUTransformFilter.

IndicNormalizationFilterFactory (multi)
Factory for IndicNormalizationFilter.

IndonesianStemFilterFactory
Factory for IndonesianStemFilter.

IrishLowerCaseFilterFactory (multi)
Factory for IrishLowerCaseFilter.

ItalianLightStemFilterFactory
Factory for ItalianLightStemFilter.

JapaneseBaseFormFilterFactory in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for org.apache.lucene.analysis.ja.JapaneseBaseFormFilter.

JapaneseKatakanaStemFilterFactory in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for JapaneseKatakanaStemFilter.

JapanesePartOfSpeechStopFilterFactory in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for org.apache.lucene.analysis.ja.JapanesePartOfSpeechStopFilter.

JapaneseReadingFormFilterFactory in lucene-analyzers-kuromoji-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for org.apache.lucene.analysis.ja.JapaneseReadingFormFilter.

KeepWordFilterFactory
Factory for KeepWordFilter.

KeywordMarkerFilterFactory
Factory for KeywordMarkerFilter.

KeywordRepeatFilterFactory
Factory for KeywordRepeatFilter.

KStemFilterFactory
Factory for KStemFilter.

LatvianStemFilterFactory
Factory for LatvianStemFilter.

LengthFilterFactory
Factory for LengthFilter.

LimitTokenCountFilterFactory
Factory for LimitTokenCountFilter.

LimitTokenPositionFilterFactory
Factory for LimitTokenPositionFilter.

LowerCaseFilterFactory (multi)
Factory for LowerCaseFilter.

MorfologikFilterFactory in lucene-analyzers-morfologik-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Filter factory for MorfologikFilter.

NGramFilterFactory
Factory for NGramTokenFilter.

NorwegianLightStemFilterFactory
Factory for NorwegianLightStemFilter.

NorwegianMinimalStemFilterFactory
Factory for NorwegianMinimalStemFilter.

NumericPayloadTokenFilterFactory
Factory for NumericPayloadTokenFilter.

PatternCaptureGroupFilterFactory
Factory for PatternCaptureGroupTokenFilter.

PatternReplaceFilterFactory
Factory for PatternReplaceFilter.

PersianNormalizationFilterFactory (multi)
Factory for PersianNormalizationFilter.

PhoneticFilterFactory in lucene-analyzers-phonetic-4.10.1.jar ( example/solr-webapp/webapp/WEB-INF/lib/ )
Factory for PhoneticFilter.

PorterStemFilterFactory
Factory for PorterStemFilter.

PortugueseLightStemFilterFactory
Factory for PortugueseLightStemFilter.

PortugueseMinimalStemFilterFactory
Factory for PortugueseMinimalStemFilter.

PortugueseStemFilterFactory
Factory for PortugueseStemFilter.

PositionFilterFactory
Factory for PositionFilter.

RemoveDuplicatesTokenFilterFactory
Factory for RemoveDuplicatesTokenFilter.

ReversedWildcardFilterFactory in solr-core-4.10.1.jar ( dist/ )
Factory for ReversedWildcardFilter-s.

ReverseStringFilterFactory
Factory for ReverseStringFilter.

RussianLightStemFilterFactory
Factory for RussianLightStemFilter.

ScandinavianFoldingFilterFactory
Factory for ScandinavianFoldingFilter.

ScandinavianNormalizationFilterFactory
Factory for org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilter.

ShingleFilterFactory
Factory for ShingleFilter.

SmartChineseWordTokenFilterFactory in lucene-analyzers-smartcn-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for the SmartChineseAnalyzer WordTokenFilter

SnowballPorterFilterFactory
Factory for SnowballFilter, with configurable language

SoraniNormalizationFilterFactory (multi)
Factory for SoraniNormalizationFilter.

SoraniStemFilterFactory
Factory for SoraniStemFilter.

SpanishLightStemFilterFactory
Factory for SpanishLightStemFilter.

StandardFilterFactory
Factory for StandardFilter.

StemmerOverrideFilterFactory
Factory for StemmerOverrideFilter.

StempelPolishStemFilterFactory in lucene-analyzers-stempel-4.10.1.jar ( contrib/analysis-extras/lucene-libs/ )
Factory for StempelFilter using a Polish stemming table.

StopFilterFactory
Factory for StopFilter.

SwedishLightStemFilterFactory
Factory for SwedishLightStemFilter.

SynonymFilterFactory
Factory for SynonymFilter.

ThaiWordFilterFactory
Factory for ThaiWordFilter.

TokenOffsetPayloadTokenFilterFactory
Factory for TokenOffsetPayloadTokenFilter.

TrimFilterFactory
Factory for TrimFilter.

TruncateTokenFilterFactory
Factory for org.apache.lucene.analysis.miscellaneous.TruncateTokenFilter.

TurkishLowerCaseFilterFactory (multi)
Factory for TurkishLowerCaseFilter.

TypeAsPayloadTokenFilterFactory
Factory for TypeAsPayloadTokenFilter.

TypeTokenFilterFactory
Factory class for TypeTokenFilter.

UpperCaseFilterFactory (multi)
Factory for UpperCaseFilter.

WordDelimiterFilterFactory
Factory for WordDelimiterFilter.

Analyzer chain types

In Solr, the text is analyzed twice: once when it gets indexed and once it gets queried (searched).

It's possible to define the same chain for both of these phases

<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
	<charFilter class="solr.PersianCharFilterFactory"/>
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.ArabicNormalizationFilterFactory"/>
	<filter class="solr.PersianNormalizationFilterFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt" />
  </analyzer>
</fieldType>

Alternatively, the analyzis and query chains can be different

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Finally, there is a third - usually hidden - chain type, which is used for multiterm analysis (queries like term* and [term1..term2]). The reason it is hidden is because it is usually automatically constructed from the explicitly defined chain by only using components that are mutiterm-aware. They are marked with (multi) in the list above. The primary use case is to ensure that case-insensitive matches work as expected even when wildcards are used. You can read more complete explanation in the Solr Wiki.

To use it, add <analyzer type="multiterm"> section next to the index and query sections in the analyzer chain definition.

Short Names

Notice that most of Analyzer, Tokenizer and Filter factories can be referenced by shortname such as:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
Only non-core components require full class name, including package name.


Previous versions of this document

You can also find archive versions of this document for version 4.9.0, version 4.8.0, and version 4.7.0

Subscribe to Solr Start news and updates:

* indicates required