JCCTokenizer (Minion Search Engine)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.document.tokenizer
Class JCCTokenizer

java.lang.Object
  com.sun.labs.minion.pipeline.StageAdapter
      com.sun.labs.minion.document.tokenizer.Tokenizer
          com.sun.labs.minion.document.tokenizer.JCCTokenizer

All Implemented Interfaces:: JCCTokenizerConstants, Stage, PipelineStage, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable

public class JCCTokenizer
extends Tokenizer
implements JCCTokenizerConstants
extends Tokenizer
implements JCCTokenizerConstants

Field Summary
`protected java.lang.StringBuilder`	`buildUp` A place to build up strings across tokens, if we need to.
`protected boolean`	`isNgram` Is the data that we've built up for an ngram tokenized language?
`Token`	`jj_nt`
`protected static java.lang.String`	`logTag`
`protected java.util.regex.Pattern`	`noBreakChars` A regular expression pattern of characters for which we should not break tokens.
`protected static java.lang.String`	`PROP_NO_BREAK_CHARS`
`protected CharArrayReader`	`reader` A reusable reader for the characters that we'll be passed.
`Token`	`token`
`JCCTokenizerTokenManager`	`token_source`

Fields inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
`dataSaved, indexed, logger, makeTokens, maxTokLen, pos, PROP_SEND_PUNCT, PROP_SEND_WHITE, saveData, savedData, savedLen, sendPunct, sendWhite, trimSpaces, wordNum`

Fields inherited from class com.sun.labs.minion.pipeline.StageAdapter
`downstream, name`

Fields inherited from interface com.sun.labs.minion.document.tokenizer.JCCTokenizerConstants
`DEFAULT, EOF, NGRAMTOKEN, NONSPACESEPCHAR, PUNCTUATION, SPACESEPCHAR, SPACESEPCHAR1, SPACESEPCHAR2, SPACESEPCHAR3, SPACESEPCHAR4, SPACESEPCHAR5, SPACESEPCHAR6, SPACESEPCHAR7, SPACESEPCHAR8, SPACESEPCHAR9, SPACESEPTOKEN, tokenImage, WHITECHAR, WHITESPACE`

Constructor Summary
`JCCTokenizer()`
`JCCTokenizer(java.io.InputStream stream)`
`JCCTokenizer(java.io.InputStream stream, java.lang.String encoding)`
`JCCTokenizer(JCCTokenizerTokenManager tm)`
`JCCTokenizer(java.io.Reader stream)`
`JCCTokenizer(Stage downstream)` Creates a JavaCC tokenizer that will not send punctuation to the downstream stage.
`JCCTokenizer(Stage downstream, boolean sendPunct)` Creates a JavaCC tokenizer.

Method Summary
`void`	`disable_tracing()`
`void`	`enable_tracing()`
`void`	`flush()` Flushes any collected tokens.
`ParseException`	`generateParseException()`
`Token`	`getNextToken()`
`Token`	`getToken(int index)`
`Tokenizer`	`getTokenizer(Stage s, boolean sp)` Gets a tokenizer that we can use in the query parser.
`void`	`handleLongChar(char c, int b, int l)` Handles a character that takes up more than one character in a file.
`void`	`newProperties(com.sun.labs.util.props.PropertySheet ps)`
`boolean`	`next()` End of autogenerated rules.
`void`	`ReInit(java.io.InputStream stream)`
`void`	`ReInit(java.io.InputStream stream, java.lang.String encoding)`
`void`	`ReInit(JCCTokenizerTokenManager tm)`
`void`	`ReInit(java.io.Reader stream)`
`void`	`send()` Sends the built up token, if there is one.
`protected void`	`sendToken(java.lang.String t, int type)`
`void`	`setNoBreakChars(java.lang.String nbcPattern)`
`void`	`text(char[] text, int b, int e)` Tokenize the given text.

Methods inherited from class com.sun.labs.minion.document.tokenizer.Tokenizer
`dump, endDocument, endField, getPos, handleFieldData, reset, reset, shutdown, startDocument, startField`

Methods inherited from class com.sun.labs.minion.pipeline.StageAdapter
`defineField, getDownstream, getName, punctuation, savedData, setDownstream, setName, token`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

reader

protected CharArrayReader reader

A reusable reader for the characters that we'll be passed.

buildUp

protected java.lang.StringBuilder buildUp

A place to build up strings across tokens, if we need to.

isNgram

protected boolean isNgram

Is the data that we've built up for an ngram tokenized language?

PROP_NO_BREAK_CHARS

@ConfigString(defaultValue="")
protected static java.lang.String PROP_NO_BREAK_CHARS

noBreakChars

protected java.util.regex.Pattern noBreakChars

A regular expression pattern of characters for which we should not break tokens.

logTag

protected static java.lang.String logTag

token_source

public JCCTokenizerTokenManager token_source

token

public Token token

jj_nt

public Token jj_nt

Constructor Detail

JCCTokenizer

public JCCTokenizer()

JCCTokenizer

public JCCTokenizer(Stage downstream)

Creates a JavaCC tokenizer that will not send punctuation to the downstream stage.

Parameters:: downstream - the stage downstream of the tokenizer.

JCCTokenizer

public JCCTokenizer(Stage downstream,
                    boolean sendPunct)

Creates a JavaCC tokenizer.

Parameters:: downstream - the stage downstream of the tokenizer.; sendPunct - if true, punctuation and whitespace will be passed to the downstream stage.

JCCTokenizer

public JCCTokenizer(java.io.InputStream stream)

JCCTokenizer

public JCCTokenizer(java.io.InputStream stream,
                    java.lang.String encoding)

JCCTokenizer

public JCCTokenizer(java.io.Reader stream)

JCCTokenizer

public JCCTokenizer(JCCTokenizerTokenManager tm)

Method Detail

text

public void text(char[] text,
                 int b,
                 int e)

Description copied from class: Tokenizer

Tokenize the given text. Output tokens will be placed on the output pipe.

Specified by:: text in interface Stage
Specified by:: text in interface PipelineStage
Specified by:: text in class Tokenizer

Parameters:: text - The text to tokenize.; b - The beginning position in the text buffer.; e - The ending position in the text buffer.

handleLongChar

public void handleLongChar(char c,
                           int b,
                           int l)

Description copied from class: Tokenizer

Handles a character that takes up more than one character in a file. For example, a character entity in an HTML file.

Specified by:: handleLongChar in class Tokenizer

Parameters:: c - The character; b - The beginning position of the character in the document.; l - The length of the character in the document.

getTokenizer

public Tokenizer getTokenizer(Stage s,
                              boolean sp)

Description copied from class: Tokenizer

Gets a tokenizer that we can use in the query parser.

Specified by:: getTokenizer in class Tokenizer

flush

public void flush()

Description copied from class: Tokenizer

Flushes any collected tokens.

Specified by:: flush in class Tokenizer

sendToken

protected void sendToken(java.lang.String t,
                         int type)

send

public void send()

Sends the built up token, if there is one.

setNoBreakChars

public void setNoBreakChars(java.lang.String nbcPattern)

newProperties

public void newProperties(com.sun.labs.util.props.PropertySheet ps)
                   throws com.sun.labs.util.props.PropertyException

Specified by:: newProperties in interface com.sun.labs.util.props.Configurable
Overrides:: newProperties in class Tokenizer

Throws:: com.sun.labs.util.props.PropertyException

public final boolean next()
                   throws ParseException

End of autogenerated rules. Have a nice day.

Throws:: ParseException

ReInit

public void ReInit(java.io.InputStream stream)

ReInit

public void ReInit(java.io.InputStream stream,
                   java.lang.String encoding)

ReInit

public void ReInit(java.io.Reader stream)

ReInit

public void ReInit(JCCTokenizerTokenManager tm)

getNextToken

public final Token getNextToken()

getToken

public final Token getToken(int index)

generateParseException

public ParseException generateParseException()

enable_tracing

public final void enable_tracing()

disable_tracing

public final void disable_tracing()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.document.tokenizer Class JCCTokenizer

reader

buildUp

isNgram

PROP_NO_BREAK_CHARS

noBreakChars

logTag

token_source

token

jj_nt

JCCTokenizer

JCCTokenizer

JCCTokenizer

JCCTokenizer

JCCTokenizer

JCCTokenizer

JCCTokenizer

text

handleLongChar

getTokenizer

flush

sendToken

send

setNoBreakChars

newProperties

next

ReInit

ReInit

ReInit

ReInit

getNextToken

getToken

generateParseException

enable_tracing

disable_tracing

com.sun.labs.minion.document.tokenizer
Class JCCTokenizer