com.sun.labs.minion.classification
Class Rocchio

java.lang.Object
  extended by com.sun.labs.minion.classification.Rocchio
All Implemented Interfaces:
BulkClassifier, ClassifierModel, ExplainableClassifierModel

public class Rocchio
extends java.lang.Object
implements ClassifierModel, BulkClassifier, ExplainableClassifierModel

A classifier model that does Rocchio-style classification.


Nested Class Summary
protected  class Rocchio.FQR
          A class to collate and hold the results of a feedback query.
protected  class Rocchio.HE
          A class to hold a single element of the heap that we'll use to negotiate the results of queries.
 
Field Summary
protected static float[] alpha
          Values of the beta and gamma parameters to try.
protected  float ba
           
protected  float bb
           
protected static float[] beta
           
protected  float bg
           
protected  FeatureClusterSet clusters
          A Set of features of features
protected  SearchEngine e
          The engine that this classifier is part of.
protected  FeatureClusterSet features
          The features that we will use for our model.
protected  java.lang.String fieldName
          The name of the field into which our classification results will go.
protected  java.text.DecimalFormat form
           
protected  java.lang.String fromField
          The name of the vectored field whose contents were used to train the classifier.
protected static float[] gamma
           
protected static java.lang.String logTag
           
protected  PartitionManager manager
          A manager for the partitions we're classifying against.
protected static int[] rankCutoff
          A set of rank cutoffs to use for dynamic query zoning.
protected  java.util.Map<DiskPartition,TermCache> termCaches
          A term cache to use when building classifiers.
protected  java.util.Map<java.lang.String,TermStatsImpl> termStats
           
protected  float threshold
          The similarity threshold for our classifier.
 
Constructor Summary
Rocchio()
           
 
Method Summary
 float checkThreshold(float score)
           
 float[] classify(DiskPartition sdp)
          Classifies a set of documents.
 float[][] classify(java.lang.String fromField, ClassifierDiskPartition cdp, DiskPartition sdp)
          Evaluates all of the classifiers in the given classifier disk partition against all of the new documents in the given disk partition.
 java.lang.String describe()
          Describes the classifier model.
 void dump(java.io.RandomAccessFile raf)
          Dumps any classifier specific data to the given file.
 java.util.List<WeightedFeature> explain(java.lang.String key)
          Explains the score that a given document would get for this classifier.
 java.lang.String explain(java.lang.String key, boolean includeDocTerms)
          Explains why (or why not) the document with the given key would (or would not) be classified into this class.
 ResultSet findSimilar()
           
 ResultSet findSimilar(java.lang.String fromField)
          Finds the documents that are most similar to this classifier, whether they are in the class or not.
 Feature getFeature()
          Gets a single feature of the type that this classifier model uses.
 FeatureClusterSet getFeatures()
          Gets the features that this classifier model will be using for classification.
 java.lang.String getFieldName()
          Gets the field name where the results of this classifier will be stored.
 java.lang.String getFromField()
           
 java.lang.String getModelName()
          Gets the name of the model.
 float getThreshold()
           
 ClassifierModel newInstance()
          Creates a new instance of this classifier model.
protected  void nextStep(Progress p, java.lang.String str)
           
 void read(java.io.RandomAccessFile raf)
          Reads any classifier specific data from the given file.
protected  Rocchio.FQR runFeedback(FeatureClusterSet cwFeatures, WeightedFeatureVector opt, java.util.List queryZone, int nRel, WeightingFunction wf, WeightingComponents wc)
          Runs a feedback query with the current estimate of the optimal query.
 void setEngine(SearchEngine e)
          Sets the search engine that this classifier is part of.
 void setFeatures(FeatureClusterSet f)
          Sets the feature clusters that the classifier model will use for classification.
 void setFieldName(java.lang.String fieldName)
          Sets the name of the field where the results of this classifier will be stored.
 void setFromField(java.lang.String fromField)
          Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.
 void setModelName(java.lang.String modelName)
          Sets the name of the model.
 float similarity(ClassifierModel cm)
          Computes the similarity between this classifier model and another.
 float similarity(DocumentVector v)
          Computes the similarity of the given document vector and the classifier.
 float similarity(java.lang.String key)
          Computes the similarity of the given document and the classifier.
 void train(java.lang.String name, java.lang.String fieldName, PartitionManager manager, ResultSetImpl training, FeatureClusterSet selectedFeatures, java.util.Map<java.lang.String,TermStatsImpl> termStats, java.util.Map<DiskPartition,TermCache> termCaches, Progress progress)
          Trains the classifier on a set of documents.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

e

protected SearchEngine e
The engine that this classifier is part of.


termCaches

protected java.util.Map<DiskPartition,TermCache> termCaches
A term cache to use when building classifiers.


termStats

protected java.util.Map<java.lang.String,TermStatsImpl> termStats

features

protected FeatureClusterSet features
The features that we will use for our model.


clusters

protected FeatureClusterSet clusters
A Set of features of features


threshold

protected float threshold
The similarity threshold for our classifier.


ba

protected float ba

bb

protected float bb

bg

protected float bg

manager

protected PartitionManager manager
A manager for the partitions we're classifying against.


fieldName

protected java.lang.String fieldName
The name of the field into which our classification results will go.


fromField

protected java.lang.String fromField
The name of the vectored field whose contents were used to train the classifier.


rankCutoff

protected static int[] rankCutoff
A set of rank cutoffs to use for dynamic query zoning.


alpha

protected static float[] alpha
Values of the beta and gamma parameters to try.


beta

protected static float[] beta

gamma

protected static float[] gamma

logTag

protected static java.lang.String logTag

form

protected java.text.DecimalFormat form
Constructor Detail

Rocchio

public Rocchio()
Method Detail

setModelName

public void setModelName(java.lang.String modelName)
Description copied from interface: ClassifierModel
Sets the name of the model.

Specified by:
setModelName in interface ClassifierModel

getModelName

public java.lang.String getModelName()
Description copied from interface: ClassifierModel
Gets the name of the model.

Specified by:
getModelName in interface ClassifierModel

getThreshold

public float getThreshold()

train

public void train(java.lang.String name,
                  java.lang.String fieldName,
                  PartitionManager manager,
                  ResultSetImpl training,
                  FeatureClusterSet selectedFeatures,
                  java.util.Map<java.lang.String,TermStatsImpl> termStats,
                  java.util.Map<DiskPartition,TermCache> termCaches,
                  Progress progress)
           throws SearchEngineException
Trains the classifier on a set of documents. Training a Rocchio classifier consists of the following set of steps:
  1. Select the features upon which to base the classifier.
  2. Build a results set by computing the or of the selected features.
  3. Take the top R documents from this results set, where R is the size of the document set given for training purposes. This is the query zone.
  4. Create two document vectors, rel and nonrel
  5. For each document, d in the query zone:
    1. If d is in the training set, then add d to rel.
    2. If d is not in the training set, then add d to nonrel.
  6. The feedback query is computed as beta/R * rel - gamma/(N-R) * nonrel, where R is the number of training documents.

Specified by:
train in interface ClassifierModel
Parameters:
name - The name of the class, as specified by the application.
manager - the manager for the partitions against which we're training
training - A set of results containing the training documents for the class.
selectedFeatures - the set of features to use when training this classifier
fieldName - the name of the field where the results of this classifier will be stored
termStats - A map from names to term statistics for the feature clusters. This map will be populated with all of the elements of fcs when this method is called.
termCaches - A map from partitions to term caches containing the uncompressed postings for the feature clusters in fcs. The caches will be fully populated with the clusters from fcs when this method is called.
Throws:
SearchEngineException - if there is any problem training the classifier.

setEngine

public void setEngine(SearchEngine e)
Description copied from interface: ClassifierModel
Sets the search engine that this classifier is part of.

Specified by:
setEngine in interface ClassifierModel

checkThreshold

public float checkThreshold(float score)

similarity

public float similarity(java.lang.String key)
Description copied from interface: ClassifierModel
Computes the similarity of the given document and the classifier.

Specified by:
similarity in interface ClassifierModel
Parameters:
key - the key of the document for which we wish to compute similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(DocumentVector v)
Description copied from interface: ClassifierModel
Computes the similarity of the given document vector and the classifier.

Specified by:
similarity in interface ClassifierModel
Parameters:
v - the document vector with which we want to calculate similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(ClassifierModel cm)
Description copied from interface: ClassifierModel
Computes the similarity between this classifier model and another.

Specified by:
similarity in interface ClassifierModel
Parameters:
cm - the model we want to compute the similarity to
Returns:
the similarity between this classifier model and the other

explain

public java.lang.String explain(java.lang.String key,
                                boolean includeDocTerms)
Description copied from interface: ExplainableClassifierModel
Explains why (or why not) the document with the given key would (or would not) be classified into this class.

Specified by:
explain in interface ExplainableClassifierModel
Parameters:
key - the key of the document whose classification we want to explain
includeDocTerms - if true, the explanation will include a description of the terms from the document.
Returns:
the explanation as a string.

explain

public java.util.List<WeightedFeature> explain(java.lang.String key)
Description copied from interface: ExplainableClassifierModel
Explains the score that a given document would get for this classifier.

Specified by:
explain in interface ExplainableClassifierModel
Parameters:
key - the key of the document that we want to explain
Returns:
a list of features that contributed to the score. The weight associated with the features is the proportion of contribution of that feature to the overall score. The list will be ordered from greatest contribution proportion to least contribution percentage.

runFeedback

protected Rocchio.FQR runFeedback(FeatureClusterSet cwFeatures,
                                  WeightedFeatureVector opt,
                                  java.util.List queryZone,
                                  int nRel,
                                  WeightingFunction wf,
                                  WeightingComponents wc)
Runs a feedback query with the current estimate of the optimal query.

Parameters:
opt - the current optimal query, as calculated by the vector difference.
queryZone - a query zone that we can use to drive per-partition processing
nRel - the number of relevant documents, which is the number of training examples
Returns:
the average precision on the query.

getFeatures

public FeatureClusterSet getFeatures()
Gets the features that this classifier model will be using for classification. This method must return a set containing instances of Feature.

Specified by:
getFeatures in interface BulkClassifier
Specified by:
getFeatures in interface ClassifierModel
Returns:
A set of features that will be used for classification.
See Also:
Feature

getFeature

public Feature getFeature()
Gets a single feature of the type that this classifier model uses. This feature will be filled in from the data stored for the classifier.

Specified by:
getFeature in interface ClassifierModel
Returns:
a new feature to be used during classification

dump

public void dump(java.io.RandomAccessFile raf)
          throws java.io.IOException
Dumps any classifier specific data to the given file. Currently this writes out the threshold for our classifier.

Specified by:
dump in interface ClassifierModel
Parameters:
raf - The file to which the data can be dumped.
Throws:
java.io.IOException

setFeatures

public void setFeatures(FeatureClusterSet f)
Sets the feature clusters that the classifier model will use for classification.

Specified by:
setFeatures in interface ClassifierModel
Parameters:
f - the set of features.
See Also:
FeatureCluster

read

public void read(java.io.RandomAccessFile raf)
          throws java.io.IOException
Reads any classifier specific data from the given file.

Specified by:
read in interface ClassifierModel
Parameters:
raf - The file from which the data can be read. The file will be positioned appropriately so that the data can be read.
Throws:
java.io.IOException

classify

public float[] classify(DiskPartition sdp)
Classifies a set of documents. For a Rocchio classifier the classification process is as follows:
  1. For each term in our feature set, get the term from the provided dictionary
  2. Iterate through the postings for the term, multiplying the term weights by the feature weight.
  3. Collect the per-document scores. If a score exceeds our threshold, classify that document into the class.

Specified by:
classify in interface ClassifierModel
Parameters:
sdp - a disk partition representing the recently dumped documents.
Returns:
An array of float. For a given document ID in the documents that were classified, if that element of the array is greater than 0, then the document should be classified into that class. The absolute value of the element indicates the similarity of that document to the classifier model.

classify

public float[][] classify(java.lang.String fromField,
                          ClassifierDiskPartition cdp,
                          DiskPartition sdp)
Description copied from interface: BulkClassifier
Evaluates all of the classifiers in the given classifier disk partition against all of the new documents in the given disk partition.

Specified by:
classify in interface BulkClassifier
Parameters:
fromField - the field from which the terms should be gathered.
cdp - A partition of classifiers to evaluate
sdp - A partition of documents to evaluate the classifiers against
Returns:
a two dimensional array of evaluation scores. Element i,j of the array is the score for document with ID j in the new partition for the classifier with document ID i in the classifier partition.

findSimilar

public ResultSet findSimilar()

findSimilar

public ResultSet findSimilar(java.lang.String fromField)
Finds the documents that are most similar to this classifier, whether they are in the class or not.


newInstance

public ClassifierModel newInstance()
Description copied from interface: ClassifierModel
Creates a new instance of this classifier model. This is a non-static factory method.

Specified by:
newInstance in interface ClassifierModel
Returns:
a new instance of the classifier model

nextStep

protected void nextStep(Progress p,
                        java.lang.String str)

describe

public java.lang.String describe()
Description copied from interface: ExplainableClassifierModel
Describes the classifier model.

Specified by:
describe in interface ExplainableClassifierModel

getFieldName

public java.lang.String getFieldName()
Description copied from interface: ClassifierModel
Gets the field name where the results of this classifier will be stored.

Specified by:
getFieldName in interface ClassifierModel

setFieldName

public void setFieldName(java.lang.String fieldName)
Description copied from interface: ClassifierModel
Sets the name of the field where the results of this classifier will be stored.

Specified by:
setFieldName in interface ClassifierModel

setFromField

public void setFromField(java.lang.String fromField)
Description copied from interface: ClassifierModel
Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.

Specified by:
setFromField in interface ClassifierModel
Parameters:
fromField - the name of the field that was used to generate features

getFromField

public java.lang.String getFromField()
Specified by:
getFromField in interface ClassifierModel