public class HMMChineseTokenizer extends SegmentingTokenizerBase
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
AttributeSource.State
buffer, BUFFERMAX, offset
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
HMMChineseTokenizer()
Creates a new HMMChineseTokenizer
|
HMMChineseTokenizer(AttributeFactory factory)
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
incrementWord()
Returns true if another word is available
|
void |
reset()
This method is called by a consumer before it begins consumption using
TokenStream.incrementToken() . |
protected void |
setNextSentence(int sentenceStart,
int sentenceEnd)
Provides the next input sentence for analysis
|
end, incrementToken, isSafeEnd
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public HMMChineseTokenizer()
public HMMChineseTokenizer(AttributeFactory factory)
protected void setNextSentence(int sentenceStart, int sentenceEnd)
SegmentingTokenizerBase
setNextSentence
in class SegmentingTokenizerBase
protected boolean incrementWord()
SegmentingTokenizerBase
incrementWord
in class SegmentingTokenizerBase
public void reset() throws IOException
TokenStream
TokenStream.incrementToken()
.
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call super.reset()
, otherwise
some internal state will not be correctly reset (e.g., Tokenizer
will
throw IllegalStateException
on further usage).
reset
in class SegmentingTokenizerBase
IOException