com.sun.labs.minion.document.tokenizer
Class UniversalTokenizer

java.lang.Object
  extended by com.sun.labs.minion.pipeline.StageAdapter
      extended by com.sun.labs.minion.document.tokenizer.Tokenizer
          extended by com.sun.labs.minion.document.tokenizer.UniversalTokenizer
All Implemented Interfaces:
Stage, PipelineStage, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable

public class UniversalTokenizer
extends Tokenizer

A class for tokenizing text in any language and mixed language material. This particular subclass is meant for special use in synchronous pipelines, where all stages are running in the same thread. This leads to a lot of (almost total) code duplication between tokenizers, but there doesn't appear to be a clean way to do it without having a conditional statement for each token we want to add. This tokenizer uses bigram tokenization, whenever it encounters characters in the CJK range (Chinese, Japanese, and Korean), and otherwise tokenizes using punctuation and whitespace separators to separate words. Within runs of CJK characters, the tokenizer will ignore end-of-lines etc. and not treat them as white space, and it will generate sequences of overlapping ngrams of length one and two for every character and every character pair in the sequence. At the beginning of such a sequence, it will generate a null+char transition pair and at the end it will generate a char+null transition pair. For example, a sequence of three Chinese characters, XYZ, would tokenize as 0X, X, XY, Y, YZ, Z, Z0, representing the fact that the sequence starts with X, ends with Z, and contains the other listed characters and pairs in the listed order. In reporting beginning and ending positions for these tokens, the single character tokens begin at the position of the character and end one position later, while the bigram (pair) tokens have zero length, beginning and ending at the position of their second character (i.e., the point in between the two characters). This tokenizer also has provisions for running in either a smart or simple mode, depending on the setting of the static final flag simpleFlag. When simpleFlag is true, any punctuation characters will cause word breaks and if sendPunct is true will generate punctuation tokens. When simpleFlag is false, the tokenizer will use specialized knowledge to tokenize as whole words many common notations that include some punctuation marks, such as: "3:00", "http://www.sun.com", "william.woods@sun.com", "A&P", "U.S.", "5/15/02", "1,024", "3.1415", etc. There are also provisions for tracing the behavior, when the static final bookean authorFlag is true. This is useful when making modifications to the tokenization logic, or for understanding the behavior of the tokenizer in detail. When authorFlag is true, the public variable traceFlag can be used to turn tracing on and off. Note that the tokens generated by the tokenizer use beginning and ending position conventions similar to substrings in a string, by specifying the position of the first character in the token and the character position just beyond the last character of the token. This is different from the current convention for the pipeline events that the tokenizer handles, in which the ending convention is to specify the character position of the last character of the event. (?? Should this convention be changed?)


Field Summary
protected  boolean continueAsianFlag
           
static boolean excludeKanaFlag
           
protected static java.lang.String logTag
           
protected  int ngramLength
           
 java.lang.String noBreakCharacters
          Characters that should not cause breaks when simpleFlag is true, even though they may be punctuation characters.
 boolean noUnigramsFlag
          Blocks generation of unigram characters in between character bigrams in runs of Asian characters.
protected  char[] nullString
           
protected  boolean redundant
           
protected  boolean resumeAsianFlag
           
protected  int tokenLength
           
protected  java.lang.String tokenString
           
 boolean traceFlag
           
protected  java.lang.String transitionString
           
 
Fields inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
dataSaved, indexed, makeTokens, maxTokLen, PROP_SEND_PUNCT, PROP_SEND_WHITE, saveData, savedData, savedLen, sendPunct, sendWhite, trimSpaces, wordNum
 
Fields inherited from class com.sun.labs.minion.pipeline.StageAdapter
downstream, name
 
Constructor Summary
UniversalTokenizer()
           
UniversalTokenizer(Stage s)
          Create a tokenizer that will send its output to the given Stage.
UniversalTokenizer(Stage s, boolean sp)
          Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.
UniversalTokenizer(Stage s, boolean sp, boolean nuf)
          Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.
UniversalTokenizer(Stage s, boolean sp, boolean nuf, boolean sendWhite)
          Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.
 
Method Summary
protected  void addChar()
          Add a character to the buffer that we're building for a token.
protected  boolean checkInitialPunc(char c)
          Determine whether char is acceptable as an initial char of a token.
protected  boolean checkTrailingPunc(char c)
          Determine whether char is acceptable as a final char of a token.
 void flush()
          Finish any final token left in the buffer.
 Tokenizer getTokenizer(Stage s, boolean sp)
          A factory method to get a tokenizer.
protected  void handleChar()
          Handle a character to add to the token buffer.
 void handleLongChar(char c, int b, int l)
          Handles a character that takes up more than one character in a file.
static boolean isAsian(char c)
          A quick check for an Asian or a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).
protected  boolean isBreakingEvent(int type, int subType)
          Will the given event break a token?
static boolean isDigit(char c)
          A quick check for whether a character is a digit.
static boolean isLetterOrDigit(char c)
          A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.
static boolean isWhitespace(char c)
          A quick check for whether a character is whitespace.
static void main(java.lang.String[] args)
           
protected  void mkToken()
          Break our collected text into as many as three pieces.
 void reset()
          Reset state of tokenizer to clean slate.
static java.lang.String showCodes(java.lang.String charString)
          A function for viewing a string containing nonprintable ascii and unicode characters.
 void text(char[] text, int b, int e)
          Handle text passed to us by the markup analyzer.
protected  java.lang.String tokenSubstring(int left, int right)
          Determine token substring to generate for ngram from left to right, where left and right are not yet clamped by the ends of tokenString.
 
Methods inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
dump, endDocument, endField, getPos, handleFieldData, newProperties, reset, shutdown, startDocument, startField
 
Methods inherited from class com.sun.labs.minion.pipeline.StageAdapter
defineField, getDownstream, getName, punctuation, savedData, setDownstream, setName, token
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

noUnigramsFlag

public boolean noUnigramsFlag
Blocks generation of unigram characters in between character bigrams in runs of Asian characters.


noBreakCharacters

public java.lang.String noBreakCharacters
Characters that should not cause breaks when simpleFlag is true, even though they may be punctuation characters. Note: You can't make whitespace characters noBreakCharacters even if you put them in this list. This variable is public so that it can be changed.


continueAsianFlag

protected boolean continueAsianFlag

resumeAsianFlag

protected boolean resumeAsianFlag

transitionString

protected java.lang.String transitionString

redundant

protected boolean redundant

excludeKanaFlag

public static final boolean excludeKanaFlag
See Also:
Constant Field Values

traceFlag

public boolean traceFlag

tokenLength

protected int tokenLength

ngramLength

protected int ngramLength

nullString

protected char[] nullString

tokenString

protected java.lang.String tokenString

logTag

protected static java.lang.String logTag
Constructor Detail

UniversalTokenizer

public UniversalTokenizer()

UniversalTokenizer

public UniversalTokenizer(Stage s)
Create a tokenizer that will send its output to the given Stage.

Parameters:
s - the stage to which the output of the tokenizer will be sent.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:
s - The output stage to receive the generated tokens.
sp - Flag indicating whether to transmit punctuation.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp,
                          boolean nuf)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:
s - The output stage to receive the generated tokens.
sp - Flag indicating whether to transmit punctuation.
nuf - Flag indicating not to generate unigrams for Asian chars.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp,
                          boolean nuf,
                          boolean sendWhite)
Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:
s - The output stage to receive the generated tokens.
sp - Flag indicating whether to transmit punctuation.
nuf - Flag indicating not to generate unigrams for Asian chars.
sendWhite - Flag that causes generation of punctuation tokens for runs of whitespace characters when sp is true.
Method Detail

getTokenizer

public Tokenizer getTokenizer(Stage s,
                              boolean sp)
A factory method to get a tokenizer.

Specified by:
getTokenizer in class Tokenizer

text

public void text(char[] text,
                 int b,
                 int e)
Handle text passed to us by the markup analyzer. Specifically, handle the text that occurs from index b to index e in the text buffer, text.

Specified by:
text in interface Stage
Specified by:
text in interface PipelineStage
Specified by:
text in class Tokenizer
Parameters:
text - The buffer containing text to break into tokens.
b - The beginning position in the text buffer.
e - The ending position in the text buffer. The position p is the character position in the file that corresponds to the character at index b in the text buffer. The index e is the index just beyond the last character of the buffer to be processed.

handleLongChar

public void handleLongChar(char c,
                           int b,
                           int l)
Handles a character that takes up more than one character in a file. For example, a character entity in an HTML file.

Specified by:
handleLongChar in class Tokenizer
Parameters:
c - The character
b - The beginning position of the character in the document.
l - The length of the character in the document.

handleChar

protected void handleChar()
Handle a character to add to the token buffer.


addChar

protected void addChar()
Add a character to the buffer that we're building for a token.


flush

public void flush()
Finish any final token left in the buffer.

Specified by:
flush in class Tokenizer

reset

public void reset()
Reset state of tokenizer to clean slate.

Overrides:
reset in class Tokenizer

isBreakingEvent

protected boolean isBreakingEvent(int type,
                                  int subType)
Will the given event break a token?


mkToken

protected void mkToken()
Break our collected text into as many as three pieces. The three pieces are the preToken, the token, and the postToken. The preToken includes any initial punctuation that is removed from the token, the token is the word, and the postToken is any final punctuation that is removed from the token.


tokenSubstring

protected java.lang.String tokenSubstring(int left,
                                          int right)
Determine token substring to generate for ngram from left to right, where left and right are not yet clamped by the ends of tokenString.

Parameters:
left - The index of the first character of the ngram token.
right - The index of the position after the last character.
Returns:
a substring of the token

checkTrailingPunc

protected boolean checkTrailingPunc(char c)
Determine whether char is acceptable as a final char of a token.

Parameters:
c - The char to be tested.

checkInitialPunc

protected boolean checkInitialPunc(char c)
Determine whether char is acceptable as an initial char of a token.

Parameters:
c - The char to be tested.

isLetterOrDigit

public static final boolean isLetterOrDigit(char c)
A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.

Parameters:
c - The character to check.

isDigit

public static final boolean isDigit(char c)
A quick check for whether a character is a digit.

Parameters:
c - The character to check

isWhitespace

public static final boolean isWhitespace(char c)
A quick check for whether a character is whitespace.

Parameters:
c - The character to check

isAsian

public static final boolean isAsian(char c)
A quick check for an Asian or a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.

Parameters:
c - The character to check

showCodes

public static final java.lang.String showCodes(java.lang.String charString)
A function for viewing a string containing nonprintable ascii and unicode characters.

Parameters:
charString - the string to convert

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Throws:
java.io.IOException