Rocchio (Minion Search Engine)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.classification
Class Rocchio

java.lang.Object
  com.sun.labs.minion.classification.Rocchio

All Implemented Interfaces:: BulkClassifier, ClassifierModel, ExplainableClassifierModel

public class Rocchio
extends java.lang.Object
implements ClassifierModel, BulkClassifier, ExplainableClassifierModel
extends java.lang.Object
implements ClassifierModel, BulkClassifier, ExplainableClassifierModel

A classifier model that does Rocchio-style classification.

Nested Class Summary
`protected class`	`Rocchio.FQR` A class to collate and hold the results of a feedback query.
`protected class`	`Rocchio.HE` A class to hold a single element of the heap that we'll use to negotiate the results of queries.

Field Summary
`protected static float[]`	`alpha` Values of the beta and gamma parameters to try.
`protected float`	`ba`
`protected float`	`bb`
`protected static float[]`	`beta`
`protected float`	`bg`
`protected FeatureClusterSet`	`clusters` A Set of features of features
`protected SearchEngine`	`e` The engine that this classifier is part of.
`protected FeatureClusterSet`	`features` The features that we will use for our model.
`protected java.lang.String`	`fieldName` The name of the field into which our classification results will go.
`protected java.text.DecimalFormat`	`form`
`protected java.lang.String`	`fromField` The name of the vectored field whose contents were used to train the classifier.
`protected static float[]`	`gamma`
`protected static java.lang.String`	`logTag`
`protected PartitionManager`	`manager` A manager for the partitions we're classifying against.
`protected static int[]`	`rankCutoff` A set of rank cutoffs to use for dynamic query zoning.
`protected java.util.Map<DiskPartition,TermCache>`	`termCaches` A term cache to use when building classifiers.
`protected java.util.Map<java.lang.String,TermStatsImpl>`	`termStats`
`protected float`	`threshold` The similarity threshold for our classifier.

Constructor Summary
`Rocchio()`

Method Summary
`float`	`checkThreshold(float score)`
`float[]`	`classify(DiskPartition sdp)` Classifies a set of documents.
`float[][]`	`classify(java.lang.String fromField, ClassifierDiskPartition cdp, DiskPartition sdp)` Evaluates all of the classifiers in the given classifier disk partition against all of the new documents in the given disk partition.
`java.lang.String`	`describe()` Describes the classifier model.
`void`	`dump(java.io.RandomAccessFile raf)` Dumps any classifier specific data to the given file.
`java.util.List<WeightedFeature>`	`explain(java.lang.String key)` Explains the score that a given document would get for this classifier.
`java.lang.String`	`explain(java.lang.String key, boolean includeDocTerms)` Explains why (or why not) the document with the given key would (or would not) be classified into this class.
`ResultSet`	`findSimilar()`
`ResultSet`	`findSimilar(java.lang.String fromField)` Finds the documents that are most similar to this classifier, whether they are in the class or not.
`Feature`	`getFeature()` Gets a single feature of the type that this classifier model uses.
`FeatureClusterSet`	`getFeatures()` Gets the features that this classifier model will be using for classification.
`java.lang.String`	`getFieldName()` Gets the field name where the results of this classifier will be stored.
`java.lang.String`	`getFromField()`
`java.lang.String`	`getModelName()` Gets the name of the model.
`float`	`getThreshold()`
`ClassifierModel`	`newInstance()` Creates a new instance of this classifier model.
`protected void`	`nextStep(Progress p, java.lang.String str)`
`void`	`read(java.io.RandomAccessFile raf)` Reads any classifier specific data from the given file.
`protected Rocchio.FQR`	`runFeedback(FeatureClusterSet cwFeatures, WeightedFeatureVector opt, java.util.List queryZone, int nRel, WeightingFunction wf, WeightingComponents wc)` Runs a feedback query with the current estimate of the optimal query.
`void`	`setEngine(SearchEngine e)` Sets the search engine that this classifier is part of.
`void`	`setFeatures(FeatureClusterSet f)` Sets the feature clusters that the classifier model will use for classification.
`void`	`setFieldName(java.lang.String fieldName)` Sets the name of the field where the results of this classifier will be stored.
`void`	`setFromField(java.lang.String fromField)` Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.
`void`	`setModelName(java.lang.String modelName)` Sets the name of the model.
`float`	`similarity(ClassifierModel cm)` Computes the similarity between this classifier model and another.
`float`	`similarity(DocumentVector v)` Computes the similarity of the given document vector and the classifier.
`float`	`similarity(java.lang.String key)` Computes the similarity of the given document and the classifier.
`void`	`train(java.lang.String name, java.lang.String fieldName, PartitionManager manager, ResultSetImpl training, FeatureClusterSet selectedFeatures, java.util.Map<java.lang.String,TermStatsImpl> termStats, java.util.Map<DiskPartition,TermCache> termCaches, Progress progress)` Trains the classifier on a set of documents.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

e

protected SearchEngine e

The engine that this classifier is part of.

termCaches

protected java.util.Map<DiskPartition,TermCache> termCaches

A term cache to use when building classifiers.

termStats

protected java.util.Map<java.lang.String,TermStatsImpl> termStats

features

protected FeatureClusterSet features

The features that we will use for our model.

clusters

protected FeatureClusterSet clusters

A Set of features of features

threshold

protected float threshold

The similarity threshold for our classifier.

ba

protected float ba

bb

protected float bb

bg

protected float bg

manager

protected PartitionManager manager

A manager for the partitions we're classifying against.

fieldName

protected java.lang.String fieldName

The name of the field into which our classification results will go.

fromField

protected java.lang.String fromField

The name of the vectored field whose contents were used to train the classifier.

rankCutoff

protected static int[] rankCutoff

A set of rank cutoffs to use for dynamic query zoning.

alpha

protected static float[] alpha

Values of the beta and gamma parameters to try.

beta

protected static float[] beta

gamma

protected static float[] gamma

logTag

protected static java.lang.String logTag

form

protected java.text.DecimalFormat form

Constructor Detail

Rocchio

public Rocchio()

Method Detail

setModelName

public void setModelName(java.lang.String modelName)

Description copied from interface: ClassifierModel

Sets the name of the model.

Specified by:: setModelName in interface ClassifierModel

getModelName

public java.lang.String getModelName()

Description copied from interface: ClassifierModel

Gets the name of the model.

Specified by:: getModelName in interface ClassifierModel

getThreshold

public float getThreshold()

train

public void train(java.lang.String name,
                  java.lang.String fieldName,
                  PartitionManager manager,
                  ResultSetImpl training,
                  FeatureClusterSet selectedFeatures,
                  java.util.Map<java.lang.String,TermStatsImpl> termStats,
                  java.util.Map<DiskPartition,TermCache> termCaches,
                  Progress progress)
           throws SearchEngineException

Trains the classifier on a set of documents. Training a Rocchio classifier consists of the following set of steps:

Select the features upon which to base the classifier.
Build a results set by computing the or of the selected features.
Take the top R documents from this results set, where R is the size of the document set given for training purposes. This is the query zone.
Create two document vectors, rel and nonrel
For each document, d in the query zone:
1. If d is in the training set, then add d to rel.
2. If d is not in the training set, then add d to nonrel.
The feedback query is computed as beta/R * rel - gamma/(N-R) * nonrel, where R is the number of training documents.

Specified by:: train in interface ClassifierModel

Parameters:
name - The name of the class, as specified by the application.
manager - the manager for the partitions against which we're training
training - A set of results containing the training documents for the class.
selectedFeatures - the set of features to use when training this classifier
fieldName - the name of the field where the results of this classifier will be stored
termStats - A map from names to term statistics for the feature clusters. This map will be populated with all of the elements of fcs when this method is called.
termCaches - A map from partitions to term caches containing the uncompressed postings for the feature clusters in fcs. The caches will be fully populated with the clusters from fcs when this method is called.
Throws:
SearchEngineException - if there is any problem training the classifier.

setEngine

public void setEngine(SearchEngine e)

Description copied from interface: ClassifierModel

Sets the search engine that this classifier is part of.

Specified by:
setEngine in interface ClassifierModel

checkThreshold

public float checkThreshold(float score)

similarity

public float similarity(java.lang.String key)

Description copied from interface: ClassifierModel

Computes the similarity of the given document and the classifier.

Specified by:
similarity in interface ClassifierModel

Parameters:
key - the key of the document for which we wish to compute similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(DocumentVector v)

Description copied from interface: ClassifierModel

Computes the similarity of the given document vector and the classifier.

Specified by:
similarity in interface ClassifierModel

Parameters:
v - the document vector with which we want to calculate similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(ClassifierModel cm)

Description copied from interface: ClassifierModel

Computes the similarity between this classifier model and another.

Specified by:
similarity in interface ClassifierModel

Parameters:
cm - the model we want to compute the similarity to
Returns:
the similarity between this classifier model and the other

explain

public java.lang.String explain(java.lang.String key, boolean includeDocTerms)

Description copied from interface: ExplainableClassifierModel

Explains why (or why not) the document with the given key would (or would not) be classified into this class.

Specified by:
explain in interface ExplainableClassifierModel

Parameters:
key - the key of the document whose classification we want to explain
includeDocTerms - if true, the explanation will include a description of the terms from the document.
Returns:
the explanation as a string.

explain

public java.util.List<WeightedFeature> explain(java.lang.String key)

Description copied from interface: ExplainableClassifierModel

Explains the score that a given document would get for this classifier.

Specified by:
explain in interface ExplainableClassifierModel

Parameters:
key - the key of the document that we want to explain
Returns:
a list of features that contributed to the score. The weight associated with the features is the proportion of contribution of that feature to the overall score. The list will be ordered from greatest contribution proportion to least contribution percentage.

runFeedback

protected Rocchio.FQR runFeedback(FeatureClusterSet cwFeatures, WeightedFeatureVector opt, java.util.List queryZone, int nRel, WeightingFunction wf, WeightingComponents wc)

Runs a feedback query with the current estimate of the optimal query.

Parameters:
opt - the current optimal query, as calculated by the vector difference.
queryZone - a query zone that we can use to drive per-partition processing
nRel - the number of relevant documents, which is the number of training examples
Returns:
the average precision on the query.

getFeatures

public FeatureClusterSet getFeatures()

Gets the features that this classifier model will be using for classification. This method must return a set containing instances of Feature.

Specified by:
getFeatures in interface BulkClassifier
Specified by:
getFeatures in interface ClassifierModel

Returns:
A set of features that will be used for classification.
See Also:
Feature

getFeature

public Feature getFeature()

Gets a single feature of the type that this classifier model uses. This feature will be filled in from the data stored for the classifier.

Specified by:
getFeature in interface ClassifierModel

Returns:
a new feature to be used during classification

dump

public void dump(java.io.RandomAccessFile raf) throws java.io.IOException

Dumps any classifier specific data to the given file. Currently this writes out the threshold for our classifier.

Specified by:
dump in interface ClassifierModel

Parameters:
raf - The file to which the data can be dumped.
Throws:
java.io.IOException

setFeatures

public void setFeatures(FeatureClusterSet f)

Sets the feature clusters that the classifier model will use for classification.

Specified by:
setFeatures in interface ClassifierModel

Parameters:
f - the set of features.
See Also:
FeatureCluster

read

public void read(java.io.RandomAccessFile raf) throws java.io.IOException

Reads any classifier specific data from the given file.

Specified by:
read in interface ClassifierModel

Parameters:
raf - The file from which the data can be read. The file will be positioned appropriately so that the data can be read.
Throws:
java.io.IOException

classify

public float[] classify(DiskPartition sdp)

Classifies a set of documents. For a Rocchio classifier the classification process is as follows:

For each term in our feature set, get the term from the provided dictionary
Iterate through the postings for the term, multiplying the term weights by the feature weight.
Collect the per-document scores. If a score exceeds our threshold, classify that document into the class.

Specified by:
classify in interface ClassifierModel

Parameters:
sdp - a disk partition representing the recently dumped documents.
Returns:
An array of float. For a given document ID in the documents that were classified, if that element of the array is greater than 0, then the document should be classified into that class. The absolute value of the element indicates the similarity of that document to the classifier model.

classify

public float[][] classify(java.lang.String fromField, ClassifierDiskPartition cdp, DiskPartition sdp)

Description copied from interface: BulkClassifier

Evaluates all of the classifiers in the given classifier disk partition against all of the new documents in the given disk partition.

Specified by:
classify in interface BulkClassifier

Parameters:
fromField - the field from which the terms should be gathered.
cdp - A partition of classifiers to evaluate
sdp - A partition of documents to evaluate the classifiers against
Returns:
a two dimensional array of evaluation scores. Element i,j of the array is the score for document with ID j in the new partition for the classifier with document ID i in the classifier partition.

findSimilar

public ResultSet findSimilar()

findSimilar

public ResultSet findSimilar(java.lang.String fromField)

Finds the documents that are most similar to this classifier, whether they are in the class or not.

newInstance

public ClassifierModel newInstance()

Description copied from interface: ClassifierModel

Creates a new instance of this classifier model. This is a non-static factory method.

Specified by:
newInstance in interface ClassifierModel

Returns:
a new instance of the classifier model

nextStep

protected void nextStep(Progress p, java.lang.String str)

describe

public java.lang.String describe()

Description copied from interface: ExplainableClassifierModel

Describes the classifier model.

Specified by:
describe in interface ExplainableClassifierModel

getFieldName

public java.lang.String getFieldName()

Description copied from interface: ClassifierModel

Gets the field name where the results of this classifier will be stored.

Specified by:
getFieldName in interface ClassifierModel

setFieldName

public void setFieldName(java.lang.String fieldName)

Description copied from interface: ClassifierModel

Sets the name of the field where the results of this classifier will be stored.

Specified by:
setFieldName in interface ClassifierModel

setFromField

public void setFromField(java.lang.String fromField)

Description copied from interface: ClassifierModel

Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.

Specified by:
setFromField in interface ClassifierModel

Parameters:
fromField - the name of the field that was used to generate features

getFromField

public java.lang.String getFromField()

Specified by:
getFromField in interface ClassifierModel

Overview Package Class Use Tree Deprecated Index Help

PREV CLASS NEXT CLASS FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.classification Class Rocchio

e

termCaches

termStats

features

clusters

threshold

ba

bb

bg

manager

fieldName

fromField

rankCutoff

alpha

beta

gamma

logTag

form

Rocchio

setModelName

getModelName

getThreshold

train

setEngine

checkThreshold

similarity

similarity

similarity

explain

explain

runFeedback

getFeatures

getFeature

dump

setFeatures

read

classify

classify

findSimilar

findSimilar

newInstance

nextStep

describe

getFieldName

setFieldName

setFromField

getFromField

com.sun.labs.minion.classification
Class Rocchio