|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object com.sun.labs.minion.pipeline.StageAdapter com.sun.labs.minion.document.tokenizer.Tokenizer com.sun.labs.minion.document.tokenizer.UniversalTokenizer
public class UniversalTokenizer
A class for tokenizing text in any language and mixed language material. This particular subclass is meant for special use in synchronous pipelines, where all stages are running in the same thread. This leads to a lot of (almost total) code duplication between tokenizers, but there doesn't appear to be a clean way to do it without having a conditional statement for each token we want to add. This tokenizer uses bigram tokenization, whenever it encounters characters in the CJK range (Chinese, Japanese, and Korean), and otherwise tokenizes using punctuation and whitespace separators to separate words. Within runs of CJK characters, the tokenizer will ignore end-of-lines etc. and not treat them as white space, and it will generate sequences of overlapping ngrams of length one and two for every character and every character pair in the sequence. At the beginning of such a sequence, it will generate a null+char transition pair and at the end it will generate a char+null transition pair. For example, a sequence of three Chinese characters, XYZ, would tokenize as 0X, X, XY, Y, YZ, Z, Z0, representing the fact that the sequence starts with X, ends with Z, and contains the other listed characters and pairs in the listed order. In reporting beginning and ending positions for these tokens, the single character tokens begin at the position of the character and end one position later, while the bigram (pair) tokens have zero length, beginning and ending at the position of their second character (i.e., the point in between the two characters). This tokenizer also has provisions for running in either a smart or simple mode, depending on the setting of the static final flag simpleFlag. When simpleFlag is true, any punctuation characters will cause word breaks and if sendPunct is true will generate punctuation tokens. When simpleFlag is false, the tokenizer will use specialized knowledge to tokenize as whole words many common notations that include some punctuation marks, such as: "3:00", "http://www.sun.com", "william.woods@sun.com", "A&P", "U.S.", "5/15/02", "1,024", "3.1415", etc. There are also provisions for tracing the behavior, when the static final bookean authorFlag is true. This is useful when making modifications to the tokenization logic, or for understanding the behavior of the tokenizer in detail. When authorFlag is true, the public variable traceFlag can be used to turn tracing on and off. Note that the tokens generated by the tokenizer use beginning and ending position conventions similar to substrings in a string, by specifying the position of the first character in the token and the character position just beyond the last character of the token. This is different from the current convention for the pipeline events that the tokenizer handles, in which the ending convention is to specify the character position of the last character of the event. (?? Should this convention be changed?)
Field Summary | |
---|---|
protected boolean |
continueAsianFlag
|
static boolean |
excludeKanaFlag
|
protected static java.lang.String |
logTag
|
protected int |
ngramLength
|
java.lang.String |
noBreakCharacters
Characters that should not cause breaks when simpleFlag is true, even though they may be punctuation characters. |
boolean |
noUnigramsFlag
Blocks generation of unigram characters in between character bigrams in runs of Asian characters. |
protected char[] |
nullString
|
protected boolean |
redundant
|
protected boolean |
resumeAsianFlag
|
protected int |
tokenLength
|
protected java.lang.String |
tokenString
|
boolean |
traceFlag
|
protected java.lang.String |
transitionString
|
Fields inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer |
---|
dataSaved, indexed, makeTokens, maxTokLen, PROP_SEND_PUNCT, PROP_SEND_WHITE, saveData, savedData, savedLen, sendPunct, sendWhite, trimSpaces, wordNum |
Fields inherited from class com.sun.labs.minion.pipeline.StageAdapter |
---|
downstream, name |
Constructor Summary | |
---|---|
UniversalTokenizer()
|
|
UniversalTokenizer(Stage s)
Create a tokenizer that will send its output to the given Stage . |
|
UniversalTokenizer(Stage s,
boolean sp)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation
if boolean sp (for "sendPunct") is true. |
|
UniversalTokenizer(Stage s,
boolean sp,
boolean nuf)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation
if boolean sp (for "sendPunct") is true. |
|
UniversalTokenizer(Stage s,
boolean sp,
boolean nuf,
boolean sendWhite)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation
if boolean sp (for "sendPunct") is true. |
Method Summary | |
---|---|
protected void |
addChar()
Add a character to the buffer that we're building for a token. |
protected boolean |
checkInitialPunc(char c)
Determine whether char is acceptable as an initial char of a token. |
protected boolean |
checkTrailingPunc(char c)
Determine whether char is acceptable as a final char of a token. |
void |
flush()
Finish any final token left in the buffer. |
Tokenizer |
getTokenizer(Stage s,
boolean sp)
A factory method to get a tokenizer. |
protected void |
handleChar()
Handle a character to add to the token buffer. |
void |
handleLongChar(char c,
int b,
int l)
Handles a character that takes up more than one character in a file. |
static boolean |
isAsian(char c)
A quick check for an Asian or a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). |
protected boolean |
isBreakingEvent(int type,
int subType)
Will the given event break a token? |
static boolean |
isDigit(char c)
A quick check for whether a character is a digit. |
static boolean |
isLetterOrDigit(char c)
A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. |
static boolean |
isWhitespace(char c)
A quick check for whether a character is whitespace. |
static void |
main(java.lang.String[] args)
|
protected void |
mkToken()
Break our collected text into as many as three pieces. |
void |
reset()
Reset state of tokenizer to clean slate. |
static java.lang.String |
showCodes(java.lang.String charString)
A function for viewing a string containing nonprintable ascii and unicode characters. |
void |
text(char[] text,
int b,
int e)
Handle text passed to us by the markup analyzer. |
protected java.lang.String |
tokenSubstring(int left,
int right)
Determine token substring to generate for ngram from left to right, where left and right are not yet clamped by the ends of tokenString. |
Methods inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer |
---|
dump, endDocument, endField, getPos, handleFieldData, newProperties, reset, shutdown, startDocument, startField |
Methods inherited from class com.sun.labs.minion.pipeline.StageAdapter |
---|
defineField, getDownstream, getName, punctuation, savedData, setDownstream, setName, token |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public boolean noUnigramsFlag
public java.lang.String noBreakCharacters
protected boolean continueAsianFlag
protected boolean resumeAsianFlag
protected java.lang.String transitionString
protected boolean redundant
public static final boolean excludeKanaFlag
public boolean traceFlag
protected int tokenLength
protected int ngramLength
protected char[] nullString
protected java.lang.String tokenString
protected static java.lang.String logTag
Constructor Detail |
---|
public UniversalTokenizer()
public UniversalTokenizer(Stage s)
Stage
.
s
- the stage to which the output of the tokenizer will be sent.public UniversalTokenizer(Stage s, boolean sp)
Stage
and generate tokens for punctuation
if boolean sp (for "sendPunct") is true.
s
- The output stage to receive the generated tokens.sp
- Flag indicating whether to transmit punctuation.public UniversalTokenizer(Stage s, boolean sp, boolean nuf)
Stage
and generate tokens for punctuation
if boolean sp (for "sendPunct") is true.
s
- The output stage to receive the generated tokens.sp
- Flag indicating whether to transmit punctuation.nuf
- Flag indicating not to generate unigrams for Asian chars.public UniversalTokenizer(Stage s, boolean sp, boolean nuf, boolean sendWhite)
Stage
and generate tokens for punctuation
if boolean sp (for "sendPunct") is true.
s
- The output stage to receive the generated tokens.sp
- Flag indicating whether to transmit punctuation.nuf
- Flag indicating not to generate unigrams for Asian chars.sendWhite
- Flag that causes generation of punctuation tokens
for runs of whitespace characters when sp is true.Method Detail |
---|
public Tokenizer getTokenizer(Stage s, boolean sp)
getTokenizer
in class Tokenizer
public void text(char[] text, int b, int e)
text
in interface Stage
text
in interface PipelineStage
text
in class Tokenizer
text
- The buffer containing text to break into tokens.b
- The beginning position in the text buffer.e
- The ending position in the text buffer.
The position p is the character position in the file that
corresponds to the character at index b in the text buffer.
The index e is the index just beyond the last character
of the buffer to be processed.public void handleLongChar(char c, int b, int l)
handleLongChar
in class Tokenizer
c
- The characterb
- The beginning position of the character in the document.l
- The length of the character in the document.protected void handleChar()
protected void addChar()
public void flush()
flush
in class Tokenizer
public void reset()
reset
in class Tokenizer
protected boolean isBreakingEvent(int type, int subType)
protected void mkToken()
protected java.lang.String tokenSubstring(int left, int right)
left
- The index of the first character of the ngram token.right
- The index of the position after the last character.
protected boolean checkTrailingPunc(char c)
c
- The char to be tested.protected boolean checkInitialPunc(char c)
c
- The char to be tested.public static final boolean isLetterOrDigit(char c)
c
- The character to check.public static final boolean isDigit(char c)
c
- The character to checkpublic static final boolean isWhitespace(char c)
c
- The character to checkpublic static final boolean isAsian(char c)
c
- The character to checkpublic static final java.lang.String showCodes(java.lang.String charString)
charString
- the string to convertpublic static void main(java.lang.String[] args) throws java.io.IOException
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |