org.apache.lucene.analysis

Class Token

  • All Implemented Interfaces:
    Appendable, CharSequence, Cloneable, CharTermAttribute, FlagsAttribute, OffsetAttribute, PayloadAttribute, PositionIncrementAttribute, PositionLengthAttribute, TermToBytesRefAttribute, TypeAttribute, Attribute

    Deprecated. 
    This class is outdated and no longer used since Lucene 2.9. Nuke it finally!

    @Deprecated
    public class Token
    extends PackedTokenAttributeImpl
    implements FlagsAttribute, PayloadAttribute
    A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

    The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc.

    The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

    A Token can optionally have metadata (a.k.a. payload) in the form of a variable length byte array. Use PostingsEnum.getPayload() to retrieve the payloads from the index.

    NOTE: As of 2.9, Token implements all Attribute interfaces that are part of core Lucene and can be found in the tokenattributes subpackage. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API. A few things to note:

    • clear() initializes all of the fields to default values. This was changed in contrast to Lucene 2.4, but should affect no one.
    • Because TokenStreams can be chained, one cannot assume that the Token's current type is correct.
    • The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.
    • When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.

    Please note: With Lucene 3.1, the toString() method had to be changed to match the CharSequence interface introduced by the interface CharTermAttribute. This method now only prints the term text, no additional information anymore.