UniversalTokenizer (Minion Search Engine)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.document.tokenizer
Class UniversalTokenizer

java.lang.Object
  com.sun.labs.minion.pipeline.StageAdapter
      com.sun.labs.minion.document.tokenizer.Tokenizer
          com.sun.labs.minion.document.tokenizer.UniversalTokenizer

All Implemented Interfaces:: Stage, PipelineStage, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable

public class UniversalTokenizer
extends Tokenizer
extends Tokenizer

A class for tokenizing text in any language and mixed language material. This particular subclass is meant for special use in synchronous pipelines, where all stages are running in the same thread. This leads to a lot of (almost total) code duplication between tokenizers, but there doesn't appear to be a clean way to do it without having a conditional statement for each token we want to add. This tokenizer uses bigram tokenization, whenever it encounters characters in the CJK range (Chinese, Japanese, and Korean), and otherwise tokenizes using punctuation and whitespace separators to separate words. Within runs of CJK characters, the tokenizer will ignore end-of-lines etc. and not treat them as white space, and it will generate sequences of overlapping ngrams of length one and two for every character and every character pair in the sequence. At the beginning of such a sequence, it will generate a null+char transition pair and at the end it will generate a char+null transition pair. For example, a sequence of three Chinese characters, XYZ, would tokenize as 0X, X, XY, Y, YZ, Z, Z0, representing the fact that the sequence starts with X, ends with Z, and contains the other listed characters and pairs in the listed order. In reporting beginning and ending positions for these tokens, the single character tokens begin at the position of the character and end one position later, while the bigram (pair) tokens have zero length, beginning and ending at the position of their second character (i.e., the point in between the two characters). This tokenizer also has provisions for running in either a smart or simple mode, depending on the setting of the static final flag simpleFlag. When simpleFlag is true, any punctuation characters will cause word breaks and if sendPunct is true will generate punctuation tokens. When simpleFlag is false, the tokenizer will use specialized knowledge to tokenize as whole words many common notations that include some punctuation marks, such as: "3:00", "http://www.sun.com", "william.woods@sun.com", "A&P", "U.S.", "5/15/02", "1,024", "3.1415", etc. There are also provisions for tracing the behavior, when the static final bookean authorFlag is true. This is useful when making modifications to the tokenization logic, or for understanding the behavior of the tokenizer in detail. When authorFlag is true, the public variable traceFlag can be used to turn tracing on and off. Note that the tokens generated by the tokenizer use beginning and ending position conventions similar to substrings in a string, by specifying the position of the first character in the token and the character position just beyond the last character of the token. This is different from the current convention for the pipeline events that the tokenizer handles, in which the ending convention is to specify the character position of the last character of the event. (?? Should this convention be changed?)

Field Summary
`protected boolean`	`continueAsianFlag`
`static boolean`	`excludeKanaFlag`
`protected static java.lang.String`	`logTag`
`protected int`	`ngramLength`
`java.lang.String`	`noBreakCharacters` Characters that should not cause breaks when simpleFlag is true, even though they may be punctuation characters.
`boolean`	`noUnigramsFlag` Blocks generation of unigram characters in between character bigrams in runs of Asian characters.
`protected char[]`	`nullString`
`protected boolean`	`redundant`
`protected boolean`	`resumeAsianFlag`
`protected int`	`tokenLength`
`protected java.lang.String`	`tokenString`
`boolean`	`traceFlag`
`protected java.lang.String`	`transitionString`

Fields inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
`dataSaved, indexed, makeTokens, maxTokLen, PROP_SEND_PUNCT, PROP_SEND_WHITE, saveData, savedData, savedLen, sendPunct, sendWhite, trimSpaces, wordNum`

Fields inherited from class com.sun.labs.minion.pipeline.StageAdapter
`downstream, name`

Constructor Summary
`UniversalTokenizer()`
`UniversalTokenizer(Stage s)` Create a tokenizer that will send its output to the given `Stage`.
`UniversalTokenizer(Stage s, boolean sp)` Create a tokenizer that will send its output to the given `Stage` and generate tokens for punctuation if boolean sp (for "sendPunct") is true.
`UniversalTokenizer(Stage s, boolean sp, boolean nuf)` Create a tokenizer that will send its output to the given `Stage` and generate tokens for punctuation if boolean sp (for "sendPunct") is true.
`UniversalTokenizer(Stage s, boolean sp, boolean nuf, boolean sendWhite)` Create a tokenizer that will send its output to the given `Stage` and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Method Summary
`protected void`	`addChar()` Add a character to the buffer that we're building for a token.
`protected boolean`	`checkInitialPunc(char c)` Determine whether char is acceptable as an initial char of a token.
`protected boolean`	`checkTrailingPunc(char c)` Determine whether char is acceptable as a final char of a token.
`void`	`flush()` Finish any final token left in the buffer.
`Tokenizer`	`getTokenizer(Stage s, boolean sp)` A factory method to get a tokenizer.
`protected void`	`handleChar()` Handle a character to add to the token buffer.
`void`	`handleLongChar(char c, int b, int l)` Handles a character that takes up more than one character in a file.
`static boolean`	`isAsian(char c)` A quick check for an Asian or a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).
`protected boolean`	`isBreakingEvent(int type, int subType)` Will the given event break a token?
`static boolean`	`isDigit(char c)` A quick check for whether a character is a digit.
`static boolean`	`isLetterOrDigit(char c)` A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.
`static boolean`	`isWhitespace(char c)` A quick check for whether a character is whitespace.
`static void`	`main(java.lang.String[] args)`
`protected void`	`mkToken()` Break our collected text into as many as three pieces.
`void`	`reset()` Reset state of tokenizer to clean slate.
`static java.lang.String`	`showCodes(java.lang.String charString)` A function for viewing a string containing nonprintable ascii and unicode characters.
`void`	`text(char[] text, int b, int e)` Handle text passed to us by the markup analyzer.
`protected java.lang.String`	`tokenSubstring(int left, int right)` Determine token substring to generate for ngram from left to right, where left and right are not yet clamped by the ends of tokenString.

Methods inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
`dump, endDocument, endField, getPos, handleFieldData, newProperties, reset, shutdown, startDocument, startField`

Methods inherited from class com.sun.labs.minion.pipeline.StageAdapter
`defineField, getDownstream, getName, punctuation, savedData, setDownstream, setName, token`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

noUnigramsFlag

public boolean noUnigramsFlag

Blocks generation of unigram characters in between character bigrams in runs of Asian characters.

noBreakCharacters

public java.lang.String noBreakCharacters

Characters that should not cause breaks when simpleFlag is true, even though they may be punctuation characters. Note: You can't make whitespace characters noBreakCharacters even if you put them in this list. This variable is public so that it can be changed.

continueAsianFlag

protected boolean continueAsianFlag

resumeAsianFlag

protected boolean resumeAsianFlag

transitionString

protected java.lang.String transitionString

redundant

protected boolean redundant

excludeKanaFlag

public static final boolean excludeKanaFlag

See Also:: Constant Field Values

traceFlag

public boolean traceFlag

tokenLength

protected int tokenLength

ngramLength

protected int ngramLength

nullString

protected char[] nullString

tokenString

protected java.lang.String tokenString

logTag

protected static java.lang.String logTag

Constructor Detail

UniversalTokenizer

public UniversalTokenizer()

UniversalTokenizer

public UniversalTokenizer(Stage s)

Create a tokenizer that will send its output to the given Stage.

Parameters:: s - the stage to which the output of the tokenizer will be sent.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp)

Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:: s - The output stage to receive the generated tokens.; sp - Flag indicating whether to transmit punctuation.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp,
                          boolean nuf)

Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:: s - The output stage to receive the generated tokens.; sp - Flag indicating whether to transmit punctuation.; nuf - Flag indicating not to generate unigrams for Asian chars.

UniversalTokenizer

public UniversalTokenizer(Stage s,
                          boolean sp,
                          boolean nuf,
                          boolean sendWhite)

Create a tokenizer that will send its output to the given Stage and generate tokens for punctuation if boolean sp (for "sendPunct") is true.

Parameters:: s - The output stage to receive the generated tokens.; sp - Flag indicating whether to transmit punctuation.; nuf - Flag indicating not to generate unigrams for Asian chars.; sendWhite - Flag that causes generation of punctuation tokens for runs of whitespace characters when sp is true.

Method Detail

getTokenizer

public Tokenizer getTokenizer(Stage s,
                              boolean sp)

A factory method to get a tokenizer.

Specified by:: getTokenizer in class Tokenizer

text

public void text(char[] text,
                 int b,
                 int e)

Handle text passed to us by the markup analyzer. Specifically, handle the text that occurs from index b to index e in the text buffer, text.

Specified by:: text in interface Stage
Specified by:: text in interface PipelineStage
Specified by:: text in class Tokenizer

Parameters:: text - The buffer containing text to break into tokens.; b - The beginning position in the text buffer.; e - The ending position in the text buffer. The position p is the character position in the file that corresponds to the character at index b in the text buffer. The index e is the index just beyond the last character of the buffer to be processed.

handleLongChar

public void handleLongChar(char c,
                           int b,
                           int l)

Handles a character that takes up more than one character in a file. For example, a character entity in an HTML file.

Specified by:: handleLongChar in class Tokenizer

Parameters:: c - The character; b - The beginning position of the character in the document.; l - The length of the character in the document.

handleChar

protected void handleChar()

Handle a character to add to the token buffer.

addChar

protected void addChar()

Add a character to the buffer that we're building for a token.

flush

public void flush()

Finish any final token left in the buffer.

Specified by:: flush in class Tokenizer

reset

public void reset()

Reset state of tokenizer to clean slate.

Overrides:: reset in class Tokenizer

isBreakingEvent

protected boolean isBreakingEvent(int type,
                                  int subType)

Will the given event break a token?

mkToken

protected void mkToken()

Break our collected text into as many as three pieces. The three pieces are the preToken, the token, and the postToken. The preToken includes any initial punctuation that is removed from the token, the token is the word, and the postToken is any final punctuation that is removed from the token.

tokenSubstring

protected java.lang.String tokenSubstring(int left,
                                          int right)

Determine token substring to generate for ngram from left to right, where left and right are not yet clamped by the ends of tokenString.

Parameters:: left - The index of the first character of the ngram token.; right - The index of the position after the last character.
Returns:: a substring of the token

checkTrailingPunc

protected boolean checkTrailingPunc(char c)

Determine whether char is acceptable as a final char of a token.

Parameters:: c - The char to be tested.

checkInitialPunc

protected boolean checkInitialPunc(char c)

Determine whether char is acceptable as an initial char of a token.

Parameters:: c - The char to be tested.

isLetterOrDigit

public static final boolean isLetterOrDigit(char c)

A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.

Parameters:: c - The character to check.

isDigit

public static final boolean isDigit(char c)

A quick check for whether a character is a digit.

Parameters:: c - The character to check

isWhitespace

public static final boolean isWhitespace(char c)

A quick check for whether a character is whitespace.

Parameters:: c - The character to check

isAsian

public static final boolean isAsian(char c)

A quick check for an Asian or a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.

Parameters:: c - The character to check

showCodes

public static final java.lang.String showCodes(java.lang.String charString)

A function for viewing a string containing nonprintable ascii and unicode characters.

Parameters:: charString - the string to convert

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException

Throws:: java.io.IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.document.tokenizer Class UniversalTokenizer

noUnigramsFlag

noBreakCharacters

continueAsianFlag

resumeAsianFlag

transitionString

redundant

excludeKanaFlag

traceFlag

tokenLength

ngramLength

nullString

tokenString

logTag

UniversalTokenizer

UniversalTokenizer

UniversalTokenizer

UniversalTokenizer

UniversalTokenizer

getTokenizer

text

handleLongChar

handleChar

addChar

flush

reset

isBreakingEvent

mkToken

tokenSubstring

checkTrailingPunc

checkInitialPunc

isLetterOrDigit

isDigit

isWhitespace

isAsian

showCodes

main

com.sun.labs.minion.document.tokenizer
Class UniversalTokenizer