com.sun.labs.minion.classification
Class ContingencyFeatureSelector

java.lang.Object
  extended by com.sun.labs.minion.classification.ContingencyFeatureSelector
All Implemented Interfaces:
FeatureSelector
Direct Known Subclasses:
CSFeatureSelector, FastContingencyFeatureSelector, MIFeatureSelector

public class ContingencyFeatureSelector
extends java.lang.Object
implements FeatureSelector

A feature selector that builds contingency features. The weights calculated from the contingency features depend on the type that is given to the constructor for this class.

See Also:
ContingencyFeature.MUTUAL_INFORMATION, ContingencyFeature.CHI_SQUARED

Field Summary
protected static java.lang.String logTag
          A tag.
protected  StopWords stopWords
          Words to ignore during selection.
protected  int type
          How the weights should be calculated for the contingency table.
 
Constructor Summary
ContingencyFeatureSelector()
           
ContingencyFeatureSelector(int type)
          Makes a feature selector that returns features that use a contingency table to calculate weight.
 
Method Summary
protected  void computeContingency(ContingencyFeatureCluster curr, SearchEngine engine, WeightingComponents wc, int tsize, int N)
           
protected  boolean discardFeature(ContingencyFeature cf, SearchEngine engine)
          Determines whether a given feature should be discarded from the set.
 FeatureClusterSet select(FeatureClusterSet training, WeightingComponents wc, int numTrainingDocs, int numFeatures, SearchEngine engine)
          Selects the features from the documents that have the highest mutual information with the class represented by the given training set.
 void setHumanSelected(HumanSelected hs)
          Provides a set of human selected terms that should be included or excluded from consideration during the feature selection process.
 void setStopWords(StopWords stopWords)
          Sets a stopword list: words that should be ignored when selecting features.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

type

protected int type
How the weights should be calculated for the contingency table.


stopWords

protected StopWords stopWords
Words to ignore during selection.


logTag

protected static java.lang.String logTag
A tag.

Constructor Detail

ContingencyFeatureSelector

public ContingencyFeatureSelector()

ContingencyFeatureSelector

public ContingencyFeatureSelector(int type)
Makes a feature selector that returns features that use a contingency table to calculate weight. The type of weight calculated depends on the type that we're given.

Parameters:
type - the type of weight to calculate
See Also:
ContingencyFeature.MUTUAL_INFORMATION, ContingencyFeature.CHI_SQUARED
Method Detail

setHumanSelected

public void setHumanSelected(HumanSelected hs)
Provides a set of human selected terms that should be included or excluded from consideration during the feature selection process.

Specified by:
setHumanSelected in interface FeatureSelector
Parameters:
hs - a set of human selected terms that should be included or excluded during feature selection.

select

public FeatureClusterSet select(FeatureClusterSet training,
                                WeightingComponents wc,
                                int numTrainingDocs,
                                int numFeatures,
                                SearchEngine engine)
Selects the features from the documents that have the highest mutual information with the class represented by the given training set.

Specified by:
select in interface FeatureSelector
Parameters:
training - the set of features in the training set.
wc - a set of weighting components to use when weighting terms
numTrainingDocs - the number of training documents
numFeatures - the number of features to select.
engine - the search engine the features are from
Returns:
a sorted set of the features

computeContingency

protected void computeContingency(ContingencyFeatureCluster curr,
                                  SearchEngine engine,
                                  WeightingComponents wc,
                                  int tsize,
                                  int N)

discardFeature

protected boolean discardFeature(ContingencyFeature cf,
                                 SearchEngine engine)
Determines whether a given feature should be discarded from the set. This can be overridden in subclasses to use different methods for deciding when to drop a feature.

Parameters:
cf - the feature we want to test
engine - the engine we're using to do the test
Returns:
true if the feature should be discarded, false if it should be kept

setStopWords

public void setStopWords(StopWords stopWords)
Description copied from interface: FeatureSelector
Sets a stopword list: words that should be ignored when selecting features.

Specified by:
setStopWords in interface FeatureSelector
Parameters:
stopWords - the set of words to ignore when performing feature selection.