org.apache.lucene.analysis.icu.segmentation

Class DefaultICUTokenizerConfig



  • public class DefaultICUTokenizerConfig
    extends ICUTokenizerConfig
    Default ICUTokenizerConfig that is generally applicable to many languages.

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
    • Field Detail

      • WORD_IDEO

        public static final String WORD_IDEO
        Token type for words containing ideographic characters
      • WORD_HIRAGANA

        public static final String WORD_HIRAGANA
        Token type for words containing Japanese hiragana
      • WORD_KATAKANA

        public static final String WORD_KATAKANA
        Token type for words containing Japanese katakana
      • WORD_HANGUL

        public static final String WORD_HANGUL
        Token type for words containing Korean hangul
      • WORD_LETTER

        public static final String WORD_LETTER
        Token type for words that contain letters
      • WORD_NUMBER

        public static final String WORD_NUMBER
        Token type for words that appear to be numbers
    • Constructor Detail

      • DefaultICUTokenizerConfig

        public DefaultICUTokenizerConfig(boolean cjkAsWords,
                                         boolean myanmarAsWords)
        Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
        Parameters:
        cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
        myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.