|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
|---|---|
| JCCTokenizerConstants | |
| Class Summary | |
|---|---|
| HandyTokenizer | A helper class to tokenize strings and return an iterator for the results. |
| JCCTokenizer | |
| JCCTokenizerTokenManager | |
| RateTest | Test indexing rate. |
| SimpleCharStream | An implementation of interface CharStream, where the stream is assumed to contain only ASCII characters (without unicode processing). |
| Test | |
| Token | Describes the input token stream. |
| Tokenizer | |
| UniversalTokenizer | A class for tokenizing text in any language and mixed language material. |
| Exception Summary | |
|---|---|
| ParseException | This exception is thrown when parse errors are encountered. |
| Error Summary | |
|---|---|
| TokenMgrError | |
Provides two implementations of tokenization for character streams.
The tokenization package contains two distinct tokenizer. These are invoked
as part of the indexing pipeline to break strings of text into distinct
tokens. They are also used to parse the terms entered into queries into the
same tokens that they would have been broken into at indexing time. The
UniversalTokenizer is a hand-written
tokenizer that can seamlessly switch between tokenizing whitespace
separated languages (e.g. English) and CJK-style bigram text within a single
document. The JCCTokenizer is a
reimplementation of the UniversalTokenizer that uses JavaCC to generate the
tokenizer. There are still some cases that the UniversalTokenizer handles
more accurately than the JCCTokenizer.
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||