com.sun.labs.minion.classification
Interface ClassifierModel

All Known Implementing Classes:
BalancedWinnow, Rocchio

public interface ClassifierModel

An interface for training and using classifiers.


Method Summary
 float[] classify(DiskPartition sdp)
          Classifies a disk partition of documents.
 void dump(java.io.RandomAccessFile raf)
          Dumps any classifier specific data to the given file.
 Feature getFeature()
          Gets a single feature of the type that this classifier model uses.
 FeatureClusterSet getFeatures()
          Gets the features that this classifier model will be using for classification.
 java.lang.String getFieldName()
          Gets the field name where the results of this classifier will be stored.
 java.lang.String getFromField()
           
 java.lang.String getModelName()
          Gets the name of the model.
 ClassifierModel newInstance()
          Creates a new instance of this classifier model.
 void read(java.io.RandomAccessFile raf)
          Reads any classifier specific data from the given file.
 void setEngine(SearchEngine e)
          Sets the search engine that this classifier is part of.
 void setFeatures(FeatureClusterSet f)
          Sets the features that the classifier model will use for classification.
 void setFieldName(java.lang.String fieldName)
          Sets the name of the field where the results of this classifier will be stored.
 void setFromField(java.lang.String fromField)
          Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.
 void setModelName(java.lang.String modelName)
          Sets the name of the model.
 float similarity(ClassifierModel cm)
          Computes the similarity between this classifier model and another.
 float similarity(DocumentVector v)
          Computes the similarity of the given document vector and the classifier.
 float similarity(java.lang.String key)
          Computes the similarity of the given document and the classifier.
 void train(java.lang.String name, java.lang.String fieldName, PartitionManager manager, ResultSetImpl docs, FeatureClusterSet fcs, java.util.Map<java.lang.String,TermStatsImpl> termStats, java.util.Map<DiskPartition,TermCache> termCaches, Progress progress)
          Trains the classifier on a set of documents.
 

Method Detail

train

void train(java.lang.String name,
           java.lang.String fieldName,
           PartitionManager manager,
           ResultSetImpl docs,
           FeatureClusterSet fcs,
           java.util.Map<java.lang.String,TermStatsImpl> termStats,
           java.util.Map<DiskPartition,TermCache> termCaches,
           Progress progress)
           throws SearchEngineException
Trains the classifier on a set of documents.

Parameters:
name - the name of the class, as specified by the application
fieldName - the name of the field where the results of this classifier will be stored
manager - the manager for the partitions against which we're training
docs - a set of results containing the training documents for the class.
fcs - the set of features to use when training this classifier
termStats - A map from names to term statistics for the feature clusters. This map will be populated with all of the elements of fcs when this method is called.
termCaches - A map from partitions to term caches containing the uncompressed postings for the feature clusters in fcs. The caches will be fully populated with the clusters from fcs when this method is called.
Throws:
SearchEngineException - if there is any problem training the classifier.

setModelName

void setModelName(java.lang.String modelName)
Sets the name of the model.


getModelName

java.lang.String getModelName()
Gets the name of the model.


getFieldName

java.lang.String getFieldName()
Gets the field name where the results of this classifier will be stored.


setFieldName

void setFieldName(java.lang.String fieldName)
Sets the name of the field where the results of this classifier will be stored.


setFromField

void setFromField(java.lang.String fromField)
Sets the name of the field from which the classifier was built, since we'll want to classify against terms only from that field.

Parameters:
fromField - the name of the field that was used to generate features

getFromField

java.lang.String getFromField()

getFeatures

FeatureClusterSet getFeatures()
Gets the features that this classifier model will be using for classification. This method must return a set containing instances of Feature.

Returns:
A set of features that will be used for classification.
See Also:
Feature

getFeature

Feature getFeature()
Gets a single feature of the type that this classifier model uses. This feature will be filled in from the data stored for the classifier.

Returns:
a new feature to be used during classification

setEngine

void setEngine(SearchEngine e)
Sets the search engine that this classifier is part of.


dump

void dump(java.io.RandomAccessFile raf)
          throws java.io.IOException
Dumps any classifier specific data to the given file. This is only for data that is not stored in the standard dictionaries.

Parameters:
raf - The file to which the data can be dumped.
Throws:
java.io.IOException

setFeatures

void setFeatures(FeatureClusterSet f)
Sets the features that the classifier model will use for classification. The provided set will only contain instances of Feature.

Parameters:
f - the set of features.
See Also:
Feature

read

void read(java.io.RandomAccessFile raf)
          throws java.io.IOException
Reads any classifier specific data from the given file.

Parameters:
raf - The file from which the data can be read. The file will be positioned appropriately so that the data can be read.
Throws:
java.io.IOException

classify

float[] classify(DiskPartition sdp)
Classifies a disk partition of documents.

Parameters:
sdp - a disk partition
Returns:
An array of float. For a given document ID in the documents that were classified, if that element of the array is greater than 0, then the document should be classified into that class. The value of the element indicates the similarity of that document to the classifier model.

similarity

float similarity(java.lang.String key)
Computes the similarity of the given document and the classifier.

Parameters:
key - the key of the document for which we wish to compute similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

float similarity(DocumentVector v)
Computes the similarity of the given document vector and the classifier.

Parameters:
v - the document vector with which we want to calculate similarity
Returns:
the similarity between the document and this classifier. The absolute value of the return value indicates the degree of similarity. A return value that is greater than 0 should indicate that the given document would be classified into this class. A return value less than 0 should indicate that the given document would not be classified into this class.

similarity

float similarity(ClassifierModel cm)
Computes the similarity between this classifier model and another.

Parameters:
cm - the model we want to compute the similarity to
Returns:
the similarity between this classifier model and the other

newInstance

ClassifierModel newInstance()
Creates a new instance of this classifier model. This is a non-static factory method.

Returns:
a new instance of the classifier model