com.sun.labs.minion.classification
Class BalancedWinnow

java.lang.Object
  extended by com.sun.labs.minion.classification.BalancedWinnow
All Implemented Interfaces:
ClassifierModel

public class BalancedWinnow
extends java.lang.Object
implements ClassifierModel

An implementation of the Balanced Winnow classification algorithm. An instance of BalancedWinnow represents a classifier for a particular class. Classifiers can be trained and used to classify documents.


Constructor Summary
BalancedWinnow()
           
 
Method Summary
 float[] classify(DiskPartition sdp)
          Classifies a disk partition of documents.
 void dump(java.io.RandomAccessFile raf)
          Writes the threshold out that describes the minimum closeness that a vector must have to this classifier.
 Feature getFeature()
          Gets a single feature of the type that this classifier model uses.
 FeatureClusterSet getFeatures()
          Gets the features that this classifier model will be using for classification.
 java.lang.String getFieldName()
          Gets the field name where the results of this classifier will be stored.
 java.lang.String getFromField()
           
 java.lang.String getModelName()
          Gets the name of the model.
protected  java.util.List getStrengthArrays(java.util.Map<DiskPartition,java.util.List> partToDocList, FeatureClusterSet clusterSet, java.util.Map<DiskPartition,TermCache> termCaches)
           
 ClassifierModel newInstance()
          Creates a new instance of this classifier model.
protected  void nextStep(Progress p, java.lang.String str)
           
 void read(java.io.RandomAccessFile raf)
          Reads the threshold for this classifier.
 void setEngine(SearchEngine e)
          Sets the search engine that this classifier is part of.
 void setFeatures(FeatureClusterSet f)
          Sets the features that the classifier model will use for classification.
 void setFieldName(java.lang.String fieldName)
          Sets the name of the field where the results of this classifier will be stored.
 void setFromField(java.lang.String fromField)
          Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.
 void setModelName(java.lang.String modelName)
          Sets the name of the model.
 float similarity(ClassifierModel cm)
          Computes the similarity between this classifier model and another.
 float similarity(DocumentVector v)
          Computes the similarity of the given document vector and the classifier.
 float similarity(java.lang.String key)
          Computes the similarity of the given document and the classifier.
protected  float strength(int freq)
           
 void train(java.lang.String name, java.lang.String fieldName, PartitionManager manager, ResultSetImpl training, FeatureClusterSet selectedFeatures, java.util.Map<java.lang.String,TermStatsImpl> termStats, java.util.Map<DiskPartition,TermCache> termCaches, Progress progress)
          Train a balanced winnow classifier.
protected  boolean winnow(float[] upperWeight, float[] lowerWeight, java.util.List strengthArrays, boolean expectPositive)
          Actually computes the winnow sums and modifies the upper and lower weights according to balanced winnow as described in the train method.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BalancedWinnow

public BalancedWinnow()
Method Detail

setModelName

public void setModelName(java.lang.String modelName)
Description copied from interface: ClassifierModel
Sets the name of the model.

Specified by:
setModelName in interface ClassifierModel

getModelName

public java.lang.String getModelName()
Description copied from interface: ClassifierModel
Gets the name of the model.

Specified by:
getModelName in interface ClassifierModel

train

public void train(java.lang.String name,
                  java.lang.String fieldName,
                  PartitionManager manager,
                  ResultSetImpl training,
                  FeatureClusterSet selectedFeatures,
                  java.util.Map<java.lang.String,TermStatsImpl> termStats,
                  java.util.Map<DiskPartition,TermCache> termCaches,
                  Progress progress)
           throws SearchEngineException
Train a balanced winnow classifier. Balanced winnow starts with a generic classifier and uses classification mistakes to nudge the weighted feature vector along until it fits the class. Since winnow is an on-line learning classifier, no knowledge of the collection as a whole is needed, except for a few high-level stats. Several parameters are used in BalancedWinnow: Two weight vectors are used to store upper and lower weight scores. The difference of the two scores is the coefficient of the classifier vector. The weights are modified by alpha and beta (described below) when the corresponding features are encountered in misclassified documents. When a positive exmaple is misclassified, the upper weight is multiplied by alpha and the lower weight is multiplied by beta. When a negative example is misclassified, the upper weight is multiplied by beta and the lower weight is multiplied by alpha. Theta is the threshold within which documents need to be to be considered as part of the class. Alpha is the "promotion" parameter. When winnow misclassifies an example, the weights of the selected features in the example document are modified by this parameter. Alpha by definition is a > 1. Beta is the "demotion" parameter. When winnow misclassifies a negative example, the weights of the selected features in the example document are modified by this parameter. Beta by definition is 0 < b < 1. With these parameters, misclassified positives cause the overall weight to be increased, while misclassified negatives cause the overall weight to be decreased. This can cause overall weights to go negative for negative indicators in features.

Specified by:
train in interface ClassifierModel
Parameters:
name - name of classifier
manager - the partition manager for the collection
training - the set of documents in the training set
selectedFeatures - the set of feature (clusters) to use
progress - an object to use to report progress
fieldName - the name of the field where the results of this classifier will be stored
termStats - A map from names to term statistics for the feature clusters. This map will be populated with all of the elements of fcs when this method is called.
termCaches - A map from partitions to term caches containing the uncompressed postings for the feature clusters in fcs. The caches will be fully populated with the clusters from fcs when this method is called.
Throws:
SearchEngineException - if there is any error using the index while training the classifier

getStrengthArrays

protected java.util.List getStrengthArrays(java.util.Map<DiskPartition,java.util.List> partToDocList,
                                           FeatureClusterSet clusterSet,
                                           java.util.Map<DiskPartition,TermCache> termCaches)

winnow

protected boolean winnow(float[] upperWeight,
                         float[] lowerWeight,
                         java.util.List strengthArrays,
                         boolean expectPositive)
Actually computes the winnow sums and modifies the upper and lower weights according to balanced winnow as described in the train method.

Parameters:
upperWeight - the upper/positive weight array
lowerWeight - the lower/negative weight array
strengthArrays - the arrays representing the example docs
expectPositive - true if the examples are positive examples
Returns:
true if winnow made no changes to the arrays

strength

protected float strength(int freq)

getFeatures

public FeatureClusterSet getFeatures()
Description copied from interface: ClassifierModel
Gets the features that this classifier model will be using for classification. This method must return a set containing instances of Feature.

Specified by:
getFeatures in interface ClassifierModel
Returns:
A set of features that will be used for classification.
See Also:
Feature

getFeature

public Feature getFeature()
Description copied from interface: ClassifierModel
Gets a single feature of the type that this classifier model uses. This feature will be filled in from the data stored for the classifier.

Specified by:
getFeature in interface ClassifierModel
Returns:
a new feature to be used during classification

setEngine

public void setEngine(SearchEngine e)
Description copied from interface: ClassifierModel
Sets the search engine that this classifier is part of.

Specified by:
setEngine in interface ClassifierModel

dump

public void dump(java.io.RandomAccessFile raf)
          throws java.io.IOException
Writes the threshold out that describes the minimum closeness that a vector must have to this classifier.

Specified by:
dump in interface ClassifierModel
Parameters:
raf - the file (correctly positioned) to write the threshold to
Throws:
java.io.IOException

setFeatures

public void setFeatures(FeatureClusterSet f)
Description copied from interface: ClassifierModel
Sets the features that the classifier model will use for classification. The provided set will only contain instances of Feature.

Specified by:
setFeatures in interface ClassifierModel
Parameters:
f - the set of features.
See Also:
Feature

read

public void read(java.io.RandomAccessFile raf)
          throws java.io.IOException
Reads the threshold for this classifier.

Specified by:
read in interface ClassifierModel
Parameters:
raf - the file (correctly positioned) to read the threshold from
Throws:
java.io.IOException

classify

public float[] classify(DiskPartition sdp)
Description copied from interface: ClassifierModel
Classifies a disk partition of documents.

Specified by:
classify in interface ClassifierModel
Parameters:
sdp - a disk partition
Returns:
An array of float. For a given document ID in the documents that were classified, if that element of the array is greater than 0, then the document should be classified into that class. The value of the element indicates the similarity of that document to the classifier model.

similarity

public float similarity(java.lang.String key)
Description copied from interface: ClassifierModel
Computes the similarity of the given document and the classifier.

Specified by:
similarity in interface ClassifierModel
Parameters:
key - the key of the document for which we wish to compute similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(DocumentVector v)
Description copied from interface: ClassifierModel
Computes the similarity of the given document vector and the classifier.

Specified by:
similarity in interface ClassifierModel
Parameters:
v - the document vector with which we want to calculate similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

public float similarity(ClassifierModel cm)
Description copied from interface: ClassifierModel
Computes the similarity between this classifier model and another.

Specified by:
similarity in interface ClassifierModel
Parameters:
cm - the model we want to compute the similarity to
Returns:
the similarity between this classifier model and the other

nextStep

protected void nextStep(Progress p,
                        java.lang.String str)

newInstance

public ClassifierModel newInstance()
Description copied from interface: ClassifierModel
Creates a new instance of this classifier model. This is a non-static factory method.

Specified by:
newInstance in interface ClassifierModel
Returns:
a new instance of the classifier model

getFieldName

public java.lang.String getFieldName()
Description copied from interface: ClassifierModel
Gets the field name where the results of this classifier will be stored.

Specified by:
getFieldName in interface ClassifierModel

setFieldName

public void setFieldName(java.lang.String fieldName)
Description copied from interface: ClassifierModel
Sets the name of the field where the results of this classifier will be stored.

Specified by:
setFieldName in interface ClassifierModel

setFromField

public void setFromField(java.lang.String fromField)
Description copied from interface: ClassifierModel
Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.

Specified by:
setFromField in interface ClassifierModel
Parameters:
fromField - the name of the field that was used to generate features

getFromField

public java.lang.String getFromField()
Specified by:
getFromField in interface ClassifierModel