com.sun.labs.minion.pipeline
Class Token

java.lang.Object
  extended by com.sun.labs.minion.pipeline.Token
All Implemented Interfaces:
FieldOccurrence, Occurrence

public class Token
extends java.lang.Object
implements FieldOccurrence

A class encapsulating all of our knowledge about a given token. Instances of this class are passed down an indexing pipeline as they are parsed from the file.


Field Summary
static int BIGRAM
           
protected  boolean containsDigits
          An indicator to show if this token contains digits (The taxonomy classifier ignores such tokens.)
protected  int count
          The occurrence count for this token.
protected  int end
          The ending character offset for the token.
protected  int[] fields
          A set of fields active for this token.
protected  int id
          An ID assigned to this token.
static int NORMAL
           
static int PUNCT
           
protected  int start
          The starting character offset for the token.
protected  java.lang.String token
          The string for a token.
protected  int type
          The type of this token, whether standard, bigram, or punctuation.
protected  int wordNum
          The ordinal number of this word in the document.
 
Constructor Summary
Token()
           
Token(java.lang.String token, int count)
          Creates a token.
Token(java.lang.String token, int wordNum, int type)
          Creates a token.
Token(java.lang.String token, int wordNum, int start, int end)
          Creates a token that can be passed down the pipeline.
Token(java.lang.String token, int wordNum, int type, int start, int end)
          Creates a token that can be passed down the pipeline.
Token(java.lang.String token, int wordNum, int type, int start, int end, int count)
          Creates a token that can be passed down the pipeline.
 
Method Summary
 boolean containsDigits()
           
 int getCount()
          Gets the count of occurrences for this token.
 int getEnd()
           
 int[] getFields()
          Gets the fields that are active at the time of the occurrence.
 int getID()
          Gets the ID of the term in this occurrence.
 int getPos()
          Gets the position at which the occurrence was found.
 int getStart()
           
 java.lang.String getToken()
           
 int getType()
           
 int getWordNum()
           
 void incrWordNum()
           
 int length()
           
 Token reset(java.lang.String token, int wordNum, int start, int end)
           
 Token reset(java.lang.String token, int wordNum, int type, int start, int end)
           
 Token reset(java.lang.String token, int wordNum, int type, int start, int end, int count)
           
 void setCount(int count)
          Sets the count of occurrences that this occurrence represents.
 void setFields(int[] fields)
           
 void setID(int id)
          Sets the ID for this token.
 void setPos(int pos)
          Sets the position for this token.
 void setToken(java.lang.String token)
          This method is intentionally package-private.
 void setType(int type)
           
 void setWordNum(int wordNum)
          Sets the word number for this token.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

token

protected java.lang.String token
The string for a token.


wordNum

protected int wordNum
The ordinal number of this word in the document.


type

protected int type
The type of this token, whether standard, bigram, or punctuation.


start

protected int start
The starting character offset for the token.


end

protected int end
The ending character offset for the token.


count

protected int count
The occurrence count for this token.


id

protected int id
An ID assigned to this token.


fields

protected int[] fields
A set of fields active for this token.


containsDigits

protected boolean containsDigits
An indicator to show if this token contains digits (The taxonomy classifier ignores such tokens.)


NORMAL

public static final int NORMAL
See Also:
Constant Field Values

BIGRAM

public static final int BIGRAM
See Also:
Constant Field Values

PUNCT

public static final int PUNCT
See Also:
Constant Field Values
Constructor Detail

Token

public Token()

Token

public Token(java.lang.String token,
             int count)
Creates a token.


Token

public Token(java.lang.String token,
             int wordNum,
             int type)
Creates a token.


Token

public Token(java.lang.String token,
             int wordNum,
             int start,
             int end)
Creates a token that can be passed down the pipeline.

Parameters:
token - The string tokenized from the input data
wordNum - The ordinal word number of this token in the indexed material.
start - The starting character offset of this token
end - The ending character offset of this token

Token

public Token(java.lang.String token,
             int wordNum,
             int type,
             int start,
             int end)
Creates a token that can be passed down the pipeline.

Parameters:
token - The string tokenized from the input data
wordNum - The ordinal word number of this token in the indexed material.
type - The type of this token, from our constant types
start - The beginning character offset of this token
end - The ending character offset of this token

Token

public Token(java.lang.String token,
             int wordNum,
             int type,
             int start,
             int end,
             int count)
Creates a token that can be passed down the pipeline.

Parameters:
token - The string tokenized from the input data
wordNum - The ordinal word number of this token in the indexed material.
type - The type of this token, from our constant types
start - The beginning character offset of this token
end - The ending character offset of this token
Method Detail

reset

public Token reset(java.lang.String token,
                   int wordNum,
                   int type,
                   int start,
                   int end)

reset

public Token reset(java.lang.String token,
                   int wordNum,
                   int start,
                   int end)

reset

public Token reset(java.lang.String token,
                   int wordNum,
                   int type,
                   int start,
                   int end,
                   int count)

length

public int length()

getToken

public java.lang.String getToken()

setToken

public void setToken(java.lang.String token)
This method is intentionally package-private. In classification, we'll reset the token to a stemmed token. Otherwise, this object should be immutable.


getType

public int getType()

setType

public void setType(int type)

getWordNum

public int getWordNum()

incrWordNum

public void incrWordNum()

getStart

public int getStart()

getEnd

public int getEnd()

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

getID

public int getID()
Gets the ID of the term in this occurrence.

Specified by:
getID in interface Occurrence
Returns:
the ID for the term.

setID

public void setID(int id)
Sets the ID for this token.

Specified by:
setID in interface Occurrence
Parameters:
id - the ID.

getCount

public int getCount()
Gets the count of occurrences for this token.

Specified by:
getCount in interface Occurrence
Returns:
the number of occurrences.

setWordNum

public void setWordNum(int wordNum)
Sets the word number for this token.


setCount

public void setCount(int count)
Sets the count of occurrences that this occurrence represents.

Specified by:
setCount in interface Occurrence
Parameters:
count - the number of occurrences.

getPos

public int getPos()
Gets the position at which the occurrence was found.

Specified by:
getPos in interface FieldOccurrence
Returns:
the position where the occurrence was found.

setPos

public void setPos(int pos)
Sets the position for this token.


getFields

public int[] getFields()
Gets the fields that are active at the time of the occurrence.

Specified by:
getFields in interface FieldOccurrence
Returns:
an array that is as long as the number of defined fields. The ith element of this array indicates the current position in the field whose ID is i. If element 0 of this array is greater than zero, then no fields are currently active.

setFields

public void setFields(int[] fields)

containsDigits

public boolean containsDigits()