org.apache.lucene.analysis.standard

Class UAX29URLEmailTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable


    public final class UAX29URLEmailTokenizer
    extends Tokenizer
    This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

    Tokens produced are of the following types:

    • <ALPHANUM>: A sequence of alphabetic and numeric characters
    • <NUM>: A number
    • <URL>: A URL
    • <EMAIL>: An email address
    • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
    • <IDEOGRAPHIC>: A single CJKV ideographic character
    • <HIRAGANA>: A single hiragana character