Overview

To be able to search the text efficiently and effectively, Solr (mostly Lucene actually) splits the text into tokens during indexing as well as during query (search). Those tokens can also be pre- and post-filtered for additional flexibility. This allows for things like case-insensitive search, misspelt product names, synonyms, and so on.

To achieve all this flexibility, Solr comes quite a variety of methods to manipulate the text. Understanding what filters and tokenizers are available and what they actually do is a major stumbling block for new Solr users. This page provides a comprehensive overview of all the classes that can be used in Solr, together with the link to their Javadoc pages.

Most of the analyzers, tokenizers and filters are located in lucene-analyzers-common-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ), so any entry without a location indicated can be found in that jar.

Note: all of this is only applicable to the text fields with fieldType's class solr.TextField. If your fieldType's class is solr.StrField, it does not get analyzed (similar to using plain KeywordTokenizerFactory).

Non-chainable analysers

The set below are the analyzers that are standalone. They take in text and out comes a sequence of tokens. The same analyzer is used during indexing and during search. Many of these come from Lucene itself. Only analyzers that can be used by Solr are listed here. Lucene has some other analyzers that cannot be used directly because they have non-standard initialization requirements.

<fieldType name="text_greek" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
</fieldType>

Analyzer in lucene-core-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
An Analyzer builds TokenStreams, which analyze text.

AnalyzerWrapper in lucene-core-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Extension to Analyzer suitable for Analyzers which wrap other Analyzers.

ShingleAnalyzerWrapper
A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.

DutchAnalyzer
Analyzer for Dutch language.

KeywordAnalyzer
"Tokenizes" the entire stream as a single token.

MorfologikAnalyzer in lucene-analyzers-morfologik-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
org.apache.lucene.analysis.Analyzer using Morfologik library.

SimpleAnalyzer
An Analyzer that filters LetterTokenizer with LowerCaseFilter

SmartChineseAnalyzer in lucene-analyzers-smartcn-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.

StopwordAnalyzerBase in lucene-core-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Base class for Analyzers that need to make use of stopword sets.

ArabicAnalyzer
Analyzer for Arabic.

ArmenianAnalyzer
Analyzer for Armenian.

BasqueAnalyzer
Analyzer for Basque.

BrazilianAnalyzer
Analyzer for Brazilian Portuguese language.

BulgarianAnalyzer
Analyzer for Bulgarian.

CatalanAnalyzer
Analyzer for Catalan.

CJKAnalyzer
An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

ClassicAnalyzer
Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

CzechAnalyzer
Analyzer for Czech language.

DanishAnalyzer
Analyzer for Danish.

EnglishAnalyzer
Analyzer for English.

FinnishAnalyzer
Analyzer for Finnish.

FrenchAnalyzer
Analyzer for French language.

GalicianAnalyzer
Analyzer for Galician.

GermanAnalyzer
Analyzer for German language.

GreekAnalyzer
Analyzer for the Greek language.

HindiAnalyzer
Analyzer for Hindi.

HungarianAnalyzer
Analyzer for Hungarian.

IndonesianAnalyzer
Analyzer for Indonesian (Bahasa)

IrishAnalyzer
Analyzer for Irish.

ItalianAnalyzer
Analyzer for Italian.

JapaneseAnalyzer in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Analyzer for Japanese that uses morphological analysis.

LatvianAnalyzer
Analyzer for Latvian.

LithuanianAnalyzer
Analyzer for Lithuanian.

NorwegianAnalyzer
Analyzer for Norwegian.

PersianAnalyzer
Analyzer for Persian.

PolishAnalyzer in lucene-analyzers-stempel-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
Analyzer for Polish.

PortugueseAnalyzer
Analyzer for Portuguese.

RomanianAnalyzer
Analyzer for Romanian.

RussianAnalyzer
Analyzer for Russian language.

SoraniAnalyzer
Analyzer for Sorani Kurdish.

SpanishAnalyzer
Analyzer for Spanish.

StandardAnalyzer in lucene-core-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

StopAnalyzer
Filters LetterTokenizer with LowerCaseFilter and StopFilter.

SwedishAnalyzer
Analyzer for Swedish.

ThaiAnalyzer
Analyzer for Thai language.

TurkishAnalyzer
Analyzer for Turkish.

UAX29URLEmailAnalyzer
Filters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with org.apache.lucene.analysis.standard.StandardFilter, org.apache.lucene.analysis.LowerCaseFilter and org.apache.lucene.analysis.StopFilter, using a list of English stop words.

UkrainianMorfologikAnalyzer in lucene-analyzers-morfologik-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
A dictionary-based Analyzer for Ukrainian.

UnicodeWhitespaceAnalyzer
An Analyzer that uses UnicodeWhitespaceTokenizer.

WhitespaceAnalyzer
An Analyzer that uses WhitespaceTokenizer.

Chainable tokenizers and filters

A more flexible approach than a single all-encompassing tokenizer is to chain and configure some tokenizers and filters together to fit particular customer requirements. Solr allows to have up to three type of components in the chain:

Character filters
These are optional and operate on the original text (before tokens). They can change the text in any way imaginable by adding, removing or transforming characters. There could be none, one, or many of these filters and they operate in the sequence defined
Tokenizer
There can only be one of these and its presence is compulsory. The tokenizer takes the text stream and splits out a sequence of tokens with their positions. Actually, it is more complicated, as the output is actually a graph, but most of the time we can think of it as a sequence
Token filters
These filters are also optional and they work similar to character filters, but on individual tokens. They can change tokens, remove them or add additional ones. They output tokens, so naturally, they can also be chained

Character filters

CharFilterFactory
Abstract parent class for analysis factories that create CharFilter instances.

HTMLStripCharFilterFactory (Sample mentions: solr-1 )
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.

ICUNormalizer2CharFilterFactory (multi) in lucene-analyzers-icu-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
Normalize token text with ICU's Normalizer2.

JapaneseIterationMarkCharFilterFactory (multi) in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

MappingCharFilterFactory (multi)
Simplistic CharFilter that applies the mappings contained in a NormalizeCharMap to the character stream, and correcting the resulting changes to the offsets.

PatternReplaceCharFilterFactory (Sample mentions: indexing-book-1 solr-in-action-book-1 )
CharFilter that uses a regular expression for the target of replace string.

PersianCharFilterFactory (multi) (Sample mentions: solr-1 )
CharFilter that replaces instances of Zero-width non-joiner with an ordinary space.

Tokenizers

TokenizerFactory
Abstract parent class for analysis factories that create Tokenizer instances.

ClassicTokenizerFactory
A grammar-based tokenizer constructed with JFlex

EdgeNGramTokenizerFactory
Creates new instances of EdgeNGramTokenizer.

HMMChineseTokenizerFactory in lucene-analyzers-smartcn-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
Tokenizer for Chinese or mixed Chinese-English text.

ICUTokenizerFactory in lucene-analyzers-icu-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: typo3-1 )
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

JapaneseTokenizerFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )
Tokenizer for Japanese that uses morphological analysis.

KeywordTokenizerFactory (Sample mentions: solr-1 )
Emits the entire input as a single token.

LetterTokenizerFactory
A LetterTokenizer is a tokenizer that divides text at non-letters.

LowerCaseTokenizerFactory (multi)
LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together.

NGramTokenizerFactory
Tokenizes the input into n-grams of the given size(s).

PathHierarchyTokenizerFactory (Sample mentions: solr-1 blacklight-1 )
Tokenizer for path-like hierarchies.

PatternTokenizerFactory (Sample mentions: solr-in-action-book-1 )
This tokenizer uses regex pattern matching to construct distinct tokens for the input stream.

StandardTokenizerFactory (Sample mentions: solr-1 )
A grammar-based tokenizer constructed with JFlex.

ThaiTokenizerFactory (Sample mentions: solr-1 )
Tokenizer that use BreakIterator to tokenize Thai text.

UAX29URLEmailTokenizerFactory (Sample mentions: solr-1 )
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

UIMAAnnotationsTokenizerFactory in lucene-analyzers-uima-6.2.0.jar ( contrib/uima/lucene-libs/ )
org.apache.lucene.analysis.util.TokenizerFactory for UIMAAnnotationsTokenizer

UIMATypeAwareAnnotationsTokenizerFactory in lucene-analyzers-uima-6.2.0.jar ( contrib/uima/lucene-libs/ )
org.apache.lucene.analysis.util.TokenizerFactory for UIMATypeAwareAnnotationsTokenizer

WhitespaceTokenizerFactory (Sample mentions: solr-1 )
A tokenizer that divides text at whitespace characters as defined by Character#isWhitespace(int).

WikipediaTokenizerFactory
Extension of StandardTokenizer that is aware of Wikipedia syntax.

Token filters

TokenFilterFactory
Abstract parent class for analysis factories that create org.apache.lucene.analysis.TokenFilter instances.

ApostropheFilterFactory (Sample mentions: solr-1 )
Strips all characters after an apostrophe (including the apostrophe itself).

ArabicNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that applies ArabicNormalizer to normalize the orthography.

ArabicStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies ArabicStemmer to stem Arabic words..

ASCIIFoldingFilterFactory (multi) (Sample mentions: solr-in-action-book-1 )
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

BaseManagedTokenFilterFactory in solr-core-6.2.0.jar ( dist/ )
Abstract based class for implementing TokenFilterFactory objects that are managed by the REST API.

ManagedStopFilterFactory in solr-core-6.2.0.jar ( dist/ ) (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 )
TokenFilterFactory that uses the ManagedWordSetResource implementation for managing stop words using the REST API.

ManagedSynonymFilterFactory in solr-core-6.2.0.jar ( dist/ ) (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 typo3-32 typo3-33 typo3-34 typo3-35 typo3-36 typo3-37 )
TokenFilterFactory and ManagedResource implementation for doing CRUD on synonyms using the REST API.

BeiderMorseFilterFactory in lucene-analyzers-phonetic-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
TokenFilter for Beider-Morse phonetic encoding.

BrazilianStemFilterFactory (Sample mentions: typo3-1 )
A TokenFilter that applies BrazilianStemmer.

BulgarianStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies BulgarianStemmer to stem Bulgarian words.

CapitalizationFilterFactory
A filter to apply normal capitalization rules to Tokens.

CJKBigramFilterFactory (Sample mentions: solr-1 )
Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.

CJKWidthFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that normalizes CJK width differences:

  • Folds fullwidth ASCII variants into the equivalent basic latin
  • Folds halfwidth Katakana variants into the equivalent kana

ClassicFilterFactory
Normalizes tokens extracted with ClassicTokenizer.

CodepointCountFilterFactory
Removes words that are too long or too short from the stream.

CommonGramsFilterFactory
Constructs a CommonGramsFilter.

CommonGramsQueryFilterFactory
Construct CommonGramsQueryFilter.

CzechStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies CzechStemmer to stem Czech words.

DaitchMokotoffSoundexFilterFactory in lucene-analyzers-phonetic-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Create tokens for phonetic matches based on Daitch–Mokotoff Soundex.

DateRecognizerFilterFactory
Filters all tokens that cannot be parsed to a date, using the provided DateFormat.

DecimalDigitFilterFactory (multi)
Folds all Unicode digits in [:General_Category=Decimal_Number:] to Basic Latin digits (0-9).

DelimitedPayloadTokenFilterFactory (Sample mentions: solr-1 )
Characters before the delimiter are the "token", those after are the payload.

DictionaryCompoundWordTokenFilterFactory (Sample mentions: typo3-1 )
A org.apache.lucene.analysis.TokenFilter that decomposes compound words found in many Germanic languages.

DoubleMetaphoneFilterFactory in lucene-analyzers-phonetic-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )
Filter for DoubleMetaphone (supporting secondary codes)

EdgeNGramFilterFactory (Sample mentions: solr-in-action-book-1 )
Creates new instances of EdgeNGramTokenFilter.

ElisionFilterFactory (multi) (Sample mentions: solr-1 solr-2 solr-3 solr-4 typo3-1 )
Removes elisions from a TokenStream.

EnglishMinimalStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies EnglishMinimalStemmer to stem English words.

EnglishPossessiveFilterFactory (Sample mentions: solr-1 )
TokenFilter that removes possessives (trailing 's) from words.

FingerprintFilterFactory
Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.

FinnishLightStemFilterFactory
A TokenFilter that applies FinnishLightStemmer to stem Finnish words.

FrenchLightStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies FrenchLightStemmer to stem French words.

FrenchMinimalStemFilterFactory
A TokenFilter that applies FrenchMinimalStemmer to stem French words.

GalicianMinimalStemFilterFactory
A TokenFilter that applies GalicianMinimalStemmer to stem Galician words.

GalicianStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies GalicianStemmer to stem Galician words.

GermanLightStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies GermanLightStemmer to stem German words.

GermanMinimalStemFilterFactory
A TokenFilter that applies GermanMinimalStemmer to stem German words.

GermanNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
Normalizes German characters according to the heuristics of the German2 snowball algorithm.

GermanStemFilterFactory
A TokenFilter that stems German words.

GreekLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )
Normalizes token text to lower case, removes some Greek diacritics, and standardizes final sigma to sigma.

GreekStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies GreekStemmer to stem Greek words.

HindiNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that applies HindiNormalizer to normalize the orthography.

HindiStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies HindiStemmer to stem Hindi words.

HungarianLightStemFilterFactory
A TokenFilter that applies HungarianLightStemmer to stem Hungarian words.

HunspellStemFilterFactory
TokenFilterFactory that creates instances of HunspellStemFilter.

HyphenatedWordsFilterFactory
When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.

HyphenationCompoundWordTokenFilterFactory
A org.apache.lucene.analysis.TokenFilter that decomposes compound words found in many Germanic languages.

ICUFoldingFilterFactory (multi) in lucene-analyzers-icu-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: indexing-book-1 )
A TokenFilter that applies search term folding to Unicode text, applying foldings from UTR#30 Character Foldings.

ICUNormalizer2FilterFactory (multi) in lucene-analyzers-icu-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
Normalize token text with ICU's com.ibm.icu.text.Normalizer2

ICUTransformFilterFactory (multi) in lucene-analyzers-icu-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
A TokenFilter that transforms text with ICU.

IndicNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that applies IndicNormalizer to normalize text in Indian Languages.

IndonesianStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies IndonesianStemmer to stem Indonesian words.

IrishLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )
Normalises token text to lower case, handling t-prothesis and n-eclipsis (i.e., that 'nAthair' should become 'n-athair')

ItalianLightStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies ItalianLightStemmer to stem Italian words.

JapaneseBaseFormFilterFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )
Replaces term text with the BaseFormAttribute.

JapaneseKatakanaStemFilterFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )
A TokenFilter that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).

JapaneseNumberFilterFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
A TokenFilter that normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters.

JapanesePartOfSpeechStopFilterFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )
Removes tokens that match a set of part-of-speech tags.

JapaneseReadingFormFilterFactory in lucene-analyzers-kuromoji-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
A org.apache.lucene.analysis.TokenFilter that replaces the term attribute with the reading of a token in either katakana or romaji form.

KeepWordFilterFactory
A TokenFilter that only keeps tokens with text contained in the required words.

KeywordMarkerFilterFactory (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 )
Marks terms as keywords via the KeywordAttribute.

KeywordRepeatFilterFactory
This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute#setKeyword(boolean) set to true and once set to false.

KStemFilterFactory (Sample mentions: solr-in-action-book-1 )
A high-performance kstem filter for english.

LatvianStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies LatvianStemmer to stem Latvian words.

LengthFilterFactory (Sample mentions: solr-1 solr-2 )
Removes words that are too long or too short from the stream.

LimitTokenCountFilterFactory
This TokenFilter limits the number of tokens while indexing.

LimitTokenOffsetFilterFactory
Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream.

LimitTokenPositionFilterFactory
This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

LowerCaseFilterFactory (multi) (Sample mentions: solr-1 )
Normalizes token text to lower case.

MinHashFilterFactory
TokenFilterFactory for MinHashFilter.

MorfologikFilterFactory in lucene-analyzers-morfologik-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ )
Filter factory for MorfologikFilter.

NGramFilterFactory
Tokenizes the input into n-grams of the given size(s).

NorwegianLightStemFilterFactory
A TokenFilter that applies NorwegianLightStemmer to stem Norwegian words.

NorwegianMinimalStemFilterFactory
A TokenFilter that applies NorwegianMinimalStemmer to stem Norwegian words.

NumericPayloadTokenFilterFactory
Assigns a payload to a token based on the org.apache.lucene.analysis.Token#type()

PatternCaptureGroupFilterFactory
CaptureGroup uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns.

PatternReplaceFilterFactory (Sample mentions: solr-1 solr-2 solr-3 )
A TokenFilter which applies a Pattern to each token in the stream, replacing match occurances with the specified replacement string.

PersianNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that applies PersianNormalizer to normalize the orthography.

PhoneticFilterFactory in lucene-analyzers-phonetic-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Create tokens for phonetic matches.

PorterStemFilterFactory (Sample mentions: solr-1 )
Transforms the token stream as per the Porter stemming algorithm.

PortugueseLightStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies PortugueseLightStemmer to stem Portuguese words.

PortugueseMinimalStemFilterFactory
A TokenFilter that applies PortugueseMinimalStemmer to stem Portuguese words.

PortugueseStemFilterFactory
A TokenFilter that applies PortugueseStemmer to stem Portuguese words.

RemoveDuplicatesTokenFilterFactory (Sample mentions: solr-1 )
A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.

ReversedWildcardFilterFactory in solr-core-6.2.0.jar ( dist/ ) (Sample mentions: solr-1 )
This class produces a special form of reversed tokens, suitable for better handling of leading wildcards.

ReverseStringFilterFactory
Reverse token string, for example "country" => "yrtnuoc".

RussianLightStemFilterFactory (Sample mentions: indexing-book-1 )
A TokenFilter that applies RussianLightStemmer to stem Russian words.

ScandinavianFoldingFilterFactory (multi)
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.

ScandinavianNormalizationFilterFactory (multi)
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.

SerbianNormalizationFilterFactory (multi)
Normalizes Serbian Cyrillic and Latin characters to "bald" Latin.

ShingleFilterFactory (Sample mentions: solr-1 )
A ShingleFilter constructs shingles (token n-grams) from a token stream.

SnowballPorterFilterFactory (Sample mentions: solr-1 solr-2 solr-3 solr-4 solr-5 solr-6 solr-7 solr-8 solr-9 solr-10 solr-11 solr-12 solr-13 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 blacklight-1 )
A filter that stems words using a Snowball-generated stemmer.

SoraniNormalizationFilterFactory (multi) (Sample mentions: solr-1 )
A TokenFilter that applies SoraniNormalizer to normalize the orthography.

SoraniStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies SoraniStemmer to stem Sorani words.

SpanishLightStemFilterFactory (Sample mentions: solr-1 )
A TokenFilter that applies SpanishLightStemmer to stem Spanish words.

StandardFilterFactory (Sample mentions: typo3-1 )
Normalizes tokens extracted with StandardTokenizer.

StemmerOverrideFilterFactory (Sample mentions: solr-1 )
Provides the ability to override any KeywordAttribute aware stemmer with custom dictionary-based stemming.

StempelPolishStemFilterFactory in lucene-analyzers-stempel-6.2.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: typo3-1 )
Transforms the token stream as per the stemming algorithm.

StopFilterFactory (Sample mentions: solr-1 solr-2 solr-3 solr-4 solr-5 solr-6 solr-7 solr-8 solr-9 solr-10 solr-11 solr-12 solr-13 solr-14 solr-15 solr-16 solr-17 solr-18 solr-19 solr-20 solr-21 solr-22 solr-23 solr-24 solr-25 solr-26 solr-27 solr-28 solr-29 solr-30 solr-31 solr-32 solr-33 indexing-book-1 indexing-book-2 blacklight-1 )
Removes stop words from a token stream.

SuggestStopFilterFactory in lucene-suggest-6.2.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )
Like StopFilter except it will not remove the last token if that token was not followed by some token separator.

SwedishLightStemFilterFactory
A TokenFilter that applies SwedishLightStemmer to stem Swedish words.

SynonymFilterFactory (Sample mentions: solr-1 solr-2 )
Matches single or multi word synonyms in a token stream.

TokenOffsetPayloadTokenFilterFactory
Adds the OffsetAttribute#startOffset() and OffsetAttribute#endOffset() First 4 bytes are the start

TrimFilterFactory (Sample mentions: solr-1 )
Trims leading and trailing whitespace from Tokens in the stream.

TruncateTokenFilterFactory
A token filter for truncating the terms into a specific length.

TurkishLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )
Normalizes Turkish token text to lower case.

TypeAsPayloadTokenFilterFactory
Makes the org.apache.lucene.analysis.Token#type() a payload.

TypeTokenFilterFactory (Sample mentions: solr-1 indexing-book-1 )
Factory class for TypeTokenFilter.

UpperCaseFilterFactory (multi)
Normalizes token text to UPPER CASE.

WordDelimiterFilterFactory (Sample mentions: solr-1 solr-2 solr-3 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 typo3-32 typo3-33 typo3-34 typo3-35 typo3-36 typo3-37 typo3-38 typo3-39 typo3-40 typo3-41 typo3-42 typo3-43 typo3-44 typo3-45 typo3-46 typo3-47 typo3-48 typo3-49 typo3-50 typo3-51 typo3-52 typo3-53 typo3-54 typo3-55 typo3-56 typo3-57 typo3-58 typo3-59 typo3-60 typo3-61 typo3-62 typo3-63 typo3-64 typo3-65 typo3-66 typo3-67 typo3-68 typo3-69 typo3-70 typo3-71 typo3-72 typo3-73 typo3-74 typo3-75 typo3-76 typo3-77 typo3-78 typo3-79 typo3-80 typo3-81 typo3-82 typo3-83 typo3-84 typo3-85 typo3-86 typo3-87 typo3-88 typo3-89 typo3-90 solr-in-action-book-1 solr-in-action-book-2 )
Splits words into subwords and performs optional transformations on subword groups.

Analyzer chain types

In Solr, the text is analyzed twice: once when it gets indexed and once it gets queried (searched).

It's possible to define the same chain for both of these phases

<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
	<charFilter class="solr.PersianCharFilterFactory"/>
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.ArabicNormalizationFilterFactory"/>
	<filter class="solr.PersianNormalizationFilterFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt" />
  </analyzer>
</fieldType>

Alternatively, the analyzis and query chains can be different

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
	<filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Finally, there is a third - usually hidden - chain type, which is used for multiterm analysis (queries like term* and [term1..term2]). The reason it is hidden is because it is usually automatically constructed from the explicitly defined chain by only using components that are mutiterm-aware. They are marked with (multi) in the list above. The primary use case is to ensure that case-insensitive matches work as expected even when wildcards are used. You can read more complete explanation in the Solr Wiki.

To use it, add <analyzer type="multiterm"> section next to the index and query sections in the analyzer chain definition.

Short Names

Notice that most of Analyzer, Tokenizer and Filter factories can be referenced by shortname such as:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
Only non-core components require full class name, including package name.


Previous versions of this document

You can also find archive versions of this document for version 6.0.0, version 5.5.0, version 5.0.0, version 4.10.1, and version 4.7.0

Subscribe to Solr Start news and updates:

* indicates required