Package com.sun.labs.minion.document.tokenizer

Provides two implementations of tokenization for character streams.

See:
          Description

Interface Summary
JCCTokenizerConstants  
 

Class Summary
HandyTokenizer A helper class to tokenize strings and return an iterator for the results.
JCCTokenizer  
JCCTokenizerTokenManager  
RateTest Test indexing rate.
SimpleCharStream An implementation of interface CharStream, where the stream is assumed to contain only ASCII characters (without unicode processing).
Test  
Token Describes the input token stream.
Tokenizer  
UniversalTokenizer A class for tokenizing text in any language and mixed language material.
 

Exception Summary
ParseException This exception is thrown when parse errors are encountered.
 

Error Summary
TokenMgrError  
 

Package com.sun.labs.minion.document.tokenizer Description

Provides two implementations of tokenization for character streams.

The tokenization package contains two distinct tokenizer. These are invoked as part of the indexing pipeline to break strings of text into distinct tokens. They are also used to parse the terms entered into queries into the same tokens that they would have been broken into at indexing time. The UniversalTokenizer is a hand-written tokenizer that can seamlessly switch between tokenizing whitespace separated languages (e.g. English) and CJK-style bigram text within a single document. The JCCTokenizer is a reimplementation of the UniversalTokenizer that uses JavaCC to generate the tokenizer. There are still some cases that the UniversalTokenizer handles more accurately than the JCCTokenizer.