Package com.sun.labs.minion.classification

Provides the automatic document classification functionality in Minion.

See:
          Description

Interface Summary
BulkClassifier An interface for classifiers that can do bulk classification.
ClassifierModel An interface for training and using classifiers.
ExplainableClassifierModel An interface for classifier models that will allow explanations to be generated inidicating why (or why not) particular documents were (or were not) classified into a given class.
Feature An interface for the features defined by classifiers.
FeatureCluster A cluster of features
FeatureClusterer The Feature Clusterer provides the interface to create clusters of features.
FeatureSelector Selects terms from a given document or set of documents, relative to the collection the terms are part of.
Profiler An interface for profilers that will run after dump time for a new partition.
ResultSplitter Result Splitters split a result set into two distinct sets suitable for use in training and validation.
 

Class Summary
BalancedWinnow An implementation of the Balanced Winnow classification algorithm.
BigQuery A helper class for running a big query during classification operations.
ClassificationFeature A class that holds a feature useful when classifying documents.
ClassificationResult The result of a classification operation for a particular classifier.
ClassifierDiskPartition A disk partition that will hold classifier data.
ClassifierManager The ClassifierManager is a specialization of the PartitionManager.
ClassifierMemoryPartition A memory partition that will hold classifier data.
ClassifierPartitionFactory A factory for the partitions used by classifiers.
ClassifierScore  
ClusterDiskPartition A disk partition that will hold classifier data.
ClusterEntry An entry for the doc dictionary in the cluster partition.
ClusterManager The ClusterManager is a specialization of the PartitionManager.
ClusterMemoryPartition A memory partition that will hold classifier data.
ClusterPartitionFactory  
ClusterPostings Postings for the cluster documents in the cluster partition.
ClusterWeightComparator A comparator for weighted features that compares features based on their weight.
ContingencyFeature A weighted feature class that contains a 2x2 contingency table that can be used to calculate the Mutual Information or Chi-squared measures.
ContingencyFeatureCluster A cluster of contingency features
ContingencyFeatureClusterer This class provides an implementation of a feature clusterer that clusters contingency features.
ContingencyFeatureSelector A feature selector that builds contingency features.
CSFeatureSelector Chi-Squared Feature Selector that is implemented by using the ContingencyFeatureSelector.
ExtraClassification A configurable container that can be used to describe a set of classification operations to perform that would not be done by the standard classification approach.
FastContingencyFeatureSelector  
FeatureClusterSet A set of feature clusters.
FeatureEntry  
FeaturePostings An implementation of Postings that we can use to store classifier features.
HumanSelected A container for human selected terms that specifies terms that must or must not occur in particular classifiers.
KeyWordProfiler A profiler class that puts documents into classes based on the presence of particular keywords.
KFoldSplitter Provides a K-fold splitter.
KnowledgeSourceClusterer Provides an implementation of a feature clusterer built around a knowledge source.
LiteMorphClusterer Provides an implementation of a feature clusterer built around the light morphology engine.
MIFeatureSelector Mutual Information Feature Selector that is implemented by using the ContingencyFeatureSelector.
MorphClusterer Provides an implementation of a feature clusterer built around the full morphology engine.
NoSplitsSplitter Result Splitters split a result set into two distinct sets suitable for use in training and validation.
QueryZone A query zone is a set of documents that are centered around a set of feature clusters.
RandomTwoThirdsSplitter Provides two thirds/one third splits of a result set by selecting documents at random to place in either set.
Rocchio A classifier model that does Rocchio-style classification.
SimpleClusterer  
SimpleFeatureCluster A feature cluster containing a single term and a weight assigned by a standard term weighting funciton.
SimpleFeatureSelector A class that selects the top n features from a set of documents based on the weights assigned by a term weighting function.
StemmingClusterer Provides a clusterer that groups features that have the same stems.
StrFloat A string and a float!
WeightedFeature  
WeightedFeatureCluster  
WeightedFeatureClusterer  
WeightedFeatureSelector Selects the highest weighted features.
WeightedFeatureVector A class for holding a weighted feature vector.
 

Package com.sun.labs.minion.classification Description

Provides the automatic document classification functionality in Minion.

This package contains the code that implements the classification infrastructure in Minion. The package contains implementations of classifiers as well as the implementation of the classifier infrastructure. Two classifiers are currently provided: Rocchio and BalancedWinnow. Training classifiers is broken down into several steps.

  1. Feature Clustering
    Feature Clustering determines which features (aka terms, or usually words) should be combined together as if they were a single feature. This package contains a few different FeatureClusterers, each implementing a different strategy for performing clustering. Only one should be used at a time for a particular index.
  2. Feature Selection
    Feature Selection is the process of determining the top N features that can be used to best differentiate documents that will be in the class from documents that don't fit into the class. Feature Selection is actually performed on Feature Clusters, although single-feature based selection may be chosen by using the supertype ContingencyFeatureClusterer, which provides single-feature clusters.
  3. Training
    Training is the process of building the actual classifier. Using the selected feature clusters from the above step, one of the classification algorithms (as defined in the configuration file) is invoked on the training set to build the classifier.
The above process may actually be repeated many times if cross-fold validation and "feature backoff" are defined for use in the configuration file.

Once classifiers are trained, they are automatically evaluated across sets of documents as future documents are indexed to disk. Classifiers cannot be run against documents that have already be indexed. If classifiers are added or changed, the documents to be classified should be re-indexed.

Following the "Everything Is Dictionaries And Postings" mantra, the classification package defines two new partition types that are used for storing classifiers and feature clusters. The infrastructure for these classes is included in this package. The ClassifierManager handles the partitions used for storing trained classifiers, and the ClusterManager handles the partitions used for storing generated feature clusters.