com.sun.labs.minion.classification (Minion Search Engine)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package com.sun.labs.minion.classification

Provides the automatic document classification functionality in Minion.

See:
Description

Interface Summary
BulkClassifier	An interface for classifiers that can do bulk classification.
ClassifierModel	An interface for training and using classifiers.
ExplainableClassifierModel	An interface for classifier models that will allow explanations to be generated inidicating why (or why not) particular documents were (or were not) classified into a given class.
Feature	An interface for the features defined by classifiers.
FeatureCluster	A cluster of features
FeatureClusterer	The Feature Clusterer provides the interface to create clusters of features.
FeatureSelector	Selects terms from a given document or set of documents, relative to the collection the terms are part of.
Profiler	An interface for profilers that will run after dump time for a new partition.
ResultSplitter	Result Splitters split a result set into two distinct sets suitable for use in training and validation.

Class Summary
BalancedWinnow	An implementation of the Balanced Winnow classification algorithm.
BigQuery	A helper class for running a big query during classification operations.
ClassificationFeature	A class that holds a feature useful when classifying documents.
ClassificationResult	The result of a classification operation for a particular classifier.
ClassifierDiskPartition	A disk partition that will hold classifier data.
ClassifierManager	The ClassifierManager is a specialization of the PartitionManager.
ClassifierMemoryPartition	A memory partition that will hold classifier data.
ClassifierPartitionFactory	A factory for the partitions used by classifiers.
ClassifierScore
ClusterDiskPartition	A disk partition that will hold classifier data.
ClusterEntry	An entry for the doc dictionary in the cluster partition.
ClusterManager	The ClusterManager is a specialization of the PartitionManager.
ClusterMemoryPartition	A memory partition that will hold classifier data.
ClusterPartitionFactory
ClusterPostings	Postings for the cluster documents in the cluster partition.
ClusterWeightComparator	A comparator for weighted features that compares features based on their weight.
ContingencyFeature	A weighted feature class that contains a 2x2 contingency table that can be used to calculate the Mutual Information or Chi-squared measures.
ContingencyFeatureCluster	A cluster of contingency features
ContingencyFeatureClusterer	This class provides an implementation of a feature clusterer that clusters contingency features.
ContingencyFeatureSelector	A feature selector that builds contingency features.
CSFeatureSelector	Chi-Squared Feature Selector that is implemented by using the ContingencyFeatureSelector.
ExtraClassification	A configurable container that can be used to describe a set of classification operations to perform that would not be done by the standard classification approach.
FastContingencyFeatureSelector
FeatureClusterSet	A set of feature clusters.
FeatureEntry
FeaturePostings	An implementation of Postings that we can use to store classifier features.
HumanSelected	A container for human selected terms that specifies terms that must or must not occur in particular classifiers.
KeyWordProfiler	A profiler class that puts documents into classes based on the presence of particular keywords.
KFoldSplitter	Provides a K-fold splitter.
KnowledgeSourceClusterer	Provides an implementation of a feature clusterer built around a knowledge source.
LiteMorphClusterer	Provides an implementation of a feature clusterer built around the light morphology engine.
MIFeatureSelector	Mutual Information Feature Selector that is implemented by using the ContingencyFeatureSelector.
MorphClusterer	Provides an implementation of a feature clusterer built around the full morphology engine.
NoSplitsSplitter	Result Splitters split a result set into two distinct sets suitable for use in training and validation.
QueryZone	A query zone is a set of documents that are centered around a set of feature clusters.
RandomTwoThirdsSplitter	Provides two thirds/one third splits of a result set by selecting documents at random to place in either set.
Rocchio	A classifier model that does Rocchio-style classification.
SimpleClusterer
SimpleFeatureCluster	A feature cluster containing a single term and a weight assigned by a standard term weighting funciton.
SimpleFeatureSelector	A class that selects the top n features from a set of documents based on the weights assigned by a term weighting function.
StemmingClusterer	Provides a clusterer that groups features that have the same stems.
StrFloat	A string and a float!
WeightedFeature
WeightedFeatureCluster
WeightedFeatureClusterer
WeightedFeatureSelector	Selects the highest weighted features.
WeightedFeatureVector	A class for holding a weighted feature vector.

Package com.sun.labs.minion.classification Description

Provides the automatic document classification functionality in Minion.

This package contains the code that implements the classification infrastructure in Minion. The package contains implementations of classifiers as well as the implementation of the classifier infrastructure. Two classifiers are currently provided: Rocchio and BalancedWinnow. Training classifiers is broken down into several steps.

Feature Clustering
Feature Clustering determines which features (aka terms, or usually words) should be combined together as if they were a single feature. This package contains a few different FeatureClusterers, each implementing a different strategy for performing clustering. Only one should be used at a time for a particular index.
Feature Selection
Feature Selection is the process of determining the top N features that can be used to best differentiate documents that will be in the class from documents that don't fit into the class. Feature Selection is actually performed on Feature Clusters, although single-feature based selection may be chosen by using the supertype ContingencyFeatureClusterer, which provides single-feature clusters.
Training
Training is the process of building the actual classifier. Using the selected feature clusters from the above step, one of the classification algorithms (as defined in the configuration file) is invoked on the training set to build the classifier.

The above process may actually be repeated many times if cross-fold validation and "feature backoff" are defined for use in the configuration file.

Once classifiers are trained, they are automatically evaluated across sets of documents as future documents are indexed to disk. Classifiers cannot be run against documents that have already be indexed. If classifiers are added or changed, the documents to be classified should be re-indexed.

Following the "Everything Is Dictionaries And Postings" mantra, the classification package defines two new partition types that are used for storing classifiers and feature clusters. The infrastructure for these classes is included in this package. The ClassifierManager handles the partitions used for storing trained classifiers, and the ClusterManager handles the partitions used for storing generated feature clusters.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES