com.sun.labs.minion.document.tokenizer
Class JCCTokenizer

java.lang.Object
  extended by com.sun.labs.minion.pipeline.StageAdapter
      extended by com.sun.labs.minion.document.tokenizer.Tokenizer
          extended by com.sun.labs.minion.document.tokenizer.JCCTokenizer
All Implemented Interfaces:
JCCTokenizerConstants, Stage, PipelineStage, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable

public class JCCTokenizer
extends Tokenizer
implements JCCTokenizerConstants


Field Summary
protected  java.lang.StringBuilder buildUp
          A place to build up strings across tokens, if we need to.
protected  boolean isNgram
          Is the data that we've built up for an ngram tokenized language?
 Token jj_nt
           
protected static java.lang.String logTag
           
protected  java.util.regex.Pattern noBreakChars
          A regular expression pattern of characters for which we should not break tokens.
protected static java.lang.String PROP_NO_BREAK_CHARS
           
protected  CharArrayReader reader
          A reusable reader for the characters that we'll be passed.
 Token token
           
 JCCTokenizerTokenManager token_source
           
 
Fields inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
dataSaved, indexed, logger, makeTokens, maxTokLen, pos, PROP_SEND_PUNCT, PROP_SEND_WHITE, saveData, savedData, savedLen, sendPunct, sendWhite, trimSpaces, wordNum
 
Fields inherited from class com.sun.labs.minion.pipeline.StageAdapter
downstream, name
 
Fields inherited from interface com.sun.labs.minion.document.tokenizer.JCCTokenizerConstants
DEFAULT, EOF, NGRAMTOKEN, NONSPACESEPCHAR, PUNCTUATION, SPACESEPCHAR, SPACESEPCHAR1, SPACESEPCHAR2, SPACESEPCHAR3, SPACESEPCHAR4, SPACESEPCHAR5, SPACESEPCHAR6, SPACESEPCHAR7, SPACESEPCHAR8, SPACESEPCHAR9, SPACESEPTOKEN, tokenImage, WHITECHAR, WHITESPACE
 
Constructor Summary
JCCTokenizer()
           
JCCTokenizer(java.io.InputStream stream)
           
JCCTokenizer(java.io.InputStream stream, java.lang.String encoding)
           
JCCTokenizer(JCCTokenizerTokenManager tm)
           
JCCTokenizer(java.io.Reader stream)
           
JCCTokenizer(Stage downstream)
          Creates a JavaCC tokenizer that will not send punctuation to the downstream stage.
JCCTokenizer(Stage downstream, boolean sendPunct)
          Creates a JavaCC tokenizer.
 
Method Summary
 void disable_tracing()
           
 void enable_tracing()
           
 void flush()
          Flushes any collected tokens.
 ParseException generateParseException()
           
 Token getNextToken()
           
 Token getToken(int index)
           
 Tokenizer getTokenizer(Stage s, boolean sp)
          Gets a tokenizer that we can use in the query parser.
 void handleLongChar(char c, int b, int l)
          Handles a character that takes up more than one character in a file.
 void newProperties(com.sun.labs.util.props.PropertySheet ps)
           
 boolean next()
          End of autogenerated rules.
 void ReInit(java.io.InputStream stream)
           
 void ReInit(java.io.InputStream stream, java.lang.String encoding)
           
 void ReInit(JCCTokenizerTokenManager tm)
           
 void ReInit(java.io.Reader stream)
           
 void send()
          Sends the built up token, if there is one.
protected  void sendToken(java.lang.String t, int type)
           
 void setNoBreakChars(java.lang.String nbcPattern)
           
 void text(char[] text, int b, int e)
          Tokenize the given text.
 
Methods inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
dump, endDocument, endField, getPos, handleFieldData, reset, reset, shutdown, startDocument, startField
 
Methods inherited from class com.sun.labs.minion.pipeline.StageAdapter
defineField, getDownstream, getName, punctuation, savedData, setDownstream, setName, token
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

protected CharArrayReader reader
A reusable reader for the characters that we'll be passed.


buildUp

protected java.lang.StringBuilder buildUp
A place to build up strings across tokens, if we need to.


isNgram

protected boolean isNgram
Is the data that we've built up for an ngram tokenized language?


PROP_NO_BREAK_CHARS

@ConfigString(defaultValue="")
protected static java.lang.String PROP_NO_BREAK_CHARS

noBreakChars

protected java.util.regex.Pattern noBreakChars
A regular expression pattern of characters for which we should not break tokens.


logTag

protected static java.lang.String logTag

token_source

public JCCTokenizerTokenManager token_source

token

public Token token

jj_nt

public Token jj_nt
Constructor Detail

JCCTokenizer

public JCCTokenizer()

JCCTokenizer

public JCCTokenizer(Stage downstream)
Creates a JavaCC tokenizer that will not send punctuation to the downstream stage.

Parameters:
downstream - the stage downstream of the tokenizer.

JCCTokenizer

public JCCTokenizer(Stage downstream,
                    boolean sendPunct)
Creates a JavaCC tokenizer.

Parameters:
downstream - the stage downstream of the tokenizer.
sendPunct - if true, punctuation and whitespace will be passed to the downstream stage.

JCCTokenizer

public JCCTokenizer(java.io.InputStream stream)

JCCTokenizer

public JCCTokenizer(java.io.InputStream stream,
                    java.lang.String encoding)

JCCTokenizer

public JCCTokenizer(java.io.Reader stream)

JCCTokenizer

public JCCTokenizer(JCCTokenizerTokenManager tm)
Method Detail

text

public void text(char[] text,
                 int b,
                 int e)
Description copied from class: Tokenizer
Tokenize the given text. Output tokens will be placed on the output pipe.

Specified by:
text in interface Stage
Specified by:
text in interface PipelineStage
Specified by:
text in class Tokenizer
Parameters:
text - The text to tokenize.
b - The beginning position in the text buffer.
e - The ending position in the text buffer.

handleLongChar

public void handleLongChar(char c,
                           int b,
                           int l)
Description copied from class: Tokenizer
Handles a character that takes up more than one character in a file. For example, a character entity in an HTML file.

Specified by:
handleLongChar in class Tokenizer
Parameters:
c - The character
b - The beginning position of the character in the document.
l - The length of the character in the document.

getTokenizer

public Tokenizer getTokenizer(Stage s,
                              boolean sp)
Description copied from class: Tokenizer
Gets a tokenizer that we can use in the query parser.

Specified by:
getTokenizer in class Tokenizer

flush

public void flush()
Description copied from class: Tokenizer
Flushes any collected tokens.

Specified by:
flush in class Tokenizer

sendToken

protected void sendToken(java.lang.String t,
                         int type)

send

public void send()
Sends the built up token, if there is one.


setNoBreakChars

public void setNoBreakChars(java.lang.String nbcPattern)

newProperties

public void newProperties(com.sun.labs.util.props.PropertySheet ps)
                   throws com.sun.labs.util.props.PropertyException
Specified by:
newProperties in interface com.sun.labs.util.props.Configurable
Overrides:
newProperties in class Tokenizer
Throws:
com.sun.labs.util.props.PropertyException

next

public final boolean next()
                   throws ParseException
End of autogenerated rules. Have a nice day.

Throws:
ParseException

ReInit

public void ReInit(java.io.InputStream stream)

ReInit

public void ReInit(java.io.InputStream stream,
                   java.lang.String encoding)

ReInit

public void ReInit(java.io.Reader stream)

ReInit

public void ReInit(JCCTokenizerTokenManager tm)

getNextToken

public final Token getNextToken()

getToken

public final Token getToken(int index)

generateParseException

public ParseException generateParseException()

enable_tracing

public final void enable_tracing()

disable_tracing

public final void disable_tracing()