com.sun.labs.minion.classification
Class ClassifierDiskPartition

java.lang.Object
  extended by com.sun.labs.minion.indexer.partition.Partition
      extended by com.sun.labs.minion.indexer.partition.DiskPartition
          extended by com.sun.labs.minion.classification.ClassifierDiskPartition
All Implemented Interfaces:
Closeable, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable, java.lang.Comparable<Partition>

public class ClassifierDiskPartition
extends DiskPartition

A disk partition that will hold classifier data.


Field Summary
protected  ClassifierModel[] allModels
           
protected  long dataStart
          The place where the model specific data starts in the file.
protected  java.util.Map<java.lang.String,ClassificationFeature> features
          Things to fix after the open house: the main dictionary in the classifiers doesn't store the feature scores for the documents (i.e., the classifiers.) So we can't do bulk evaluation without inverting the document vectors.
protected static java.lang.String logTag
           
protected  ClassifierModel modelInstance
           
protected  java.util.Map<java.lang.String,ClassifierModel> modelMap
           
protected  java.io.RandomAccessFile msd
          The file containing the model specific data for this partition.
protected  ReadableBuffer msdOff
          A buffer containing the offsets for the model specific data for each of our classifiers.
protected  int nModels
          The number of models that we're storing.
 
Fields inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
BUFF_SIZE, deletions, delFile, delFileLock, docDict, docDictFile, docPostFile, documentDictFactory, dvl, ignored, mainDict, mainFiles, MATCH_CUT_OFF, MIN_LEN, removedFile, termCache
 
Fields inherited from class com.sun.labs.minion.indexer.partition.Partition
DICT_OFFSETS_SIZE, docDictFactory, entryClass, entryName, indexConfig, mainDictFactory, mainDictFile, mainPostFiles, manager, maxID, nEntries, partNumber, PROP_DOC_DICT_FACTORY, PROP_INDEX_CONFIG, PROP_MAIN_DICT_FACTORY, PROP_PARTITION_MANAGER, stats
 
Constructor Summary
ClassifierDiskPartition(java.lang.Integer partNum, ClassifierManager manager, DictionaryFactory mainDictFactory, DictionaryFactory documentDictFactory)
          Constructs a disk partition for a specific partition number.
 
Method Summary
 int assembleResults(float[] scores, java.lang.String modelName, java.lang.String resultField, java.util.Map<java.lang.String,ClassificationResult> results)
           
 void classify(DiskPartition sdp, ExtraClassification ec, java.util.Map<java.lang.String,ClassificationResult> results)
          Classifies all the documents in a disk partition.
 boolean close()
          Close the files associated with this partition.
 void findSimilar(ClassifierModel cm, java.util.Map<java.lang.String,java.lang.Float> scores)
           
protected  ClassifierModel[] getAllModels()
           
protected  ClassifierModel getClassifier(FeatureEntry fe)
          Gets a classifier model from an entry in our document dictionary.
protected  ClassifierModel getClassifier(java.lang.String cname)
           
 float getDocumentVectorLength(int docID)
          Gets the length of a document vector for a given document.
 java.util.Set getFeatures(java.lang.String cname)
           
protected  java.util.Map<java.lang.String,ClassificationFeature> invert()
           
protected  java.util.Set makeFeatures(FeatureEntry entry)
           
protected  void mergeCustom(int newPartNumber, DiskPartition[] sortedParts, int[][] idMaps, int newMaxDocID, int[] docIDStart, int[] nUndel, int[][] docIDMaps)
          Merges the model specific data for these classifiers.
protected static void reap(PartitionManager m, int n)
          Reaps the given classifier partition.
 
Methods inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
close, createRemoveFile, delete, deleteDocument, deleteDocument, docsAreMerged, getAverageDocumentLength, getCloseTime, getDeletedDocumentsMap, getDelMap, getDocIDMap, getDocumentIterator, getDocumentIterator, getDocumentLength, getDocumentTerm, getDocumentTerm, getDocumentVectorLength, getDocumentVectorLength, getDVL, getInputBuffers, getMainDictionary, getMainDictionaryIterator, getMainDictionaryIterator, getMainIterator, getMaxDocumentID, getMaxTermID, getNDocs, getNEntries, getNTokens, getTerm, getTerm, getTerm, getTerm, getTermCache, initAll, initDocDict, initDVL, initMainDict, initMainFiles, isDeleted, isIndexed, merge, merge, normalize, setCloseTime, syncDeletedMap, toString, updatePartition
 
Methods inherited from class com.sun.labs.minion.indexer.partition.Partition
compareTo, getAllFiles, getAllFiles, getDocFiles, getDocFiles, getIndexConfig, getMainFiles, getMainFiles, getManager, getName, getNumPostingsChannels, getPartitionNumber, getQueryConfig, getStats, newProperties
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

msd

protected java.io.RandomAccessFile msd
The file containing the model specific data for this partition.


msdOff

protected ReadableBuffer msdOff
A buffer containing the offsets for the model specific data for each of our classifiers.


nModels

protected int nModels
The number of models that we're storing.


dataStart

protected long dataStart
The place where the model specific data starts in the file.


logTag

protected static java.lang.String logTag

modelInstance

protected ClassifierModel modelInstance

features

protected java.util.Map<java.lang.String,ClassificationFeature> features
Things to fix after the open house: the main dictionary in the classifiers doesn't store the feature scores for the documents (i.e., the classifiers.) So we can't do bulk evaluation without inverting the document vectors. We'll do that once and keep it here.


allModels

protected ClassifierModel[] allModels

modelMap

protected java.util.Map<java.lang.String,ClassifierModel> modelMap
Constructor Detail

ClassifierDiskPartition

public ClassifierDiskPartition(java.lang.Integer partNum,
                               ClassifierManager manager,
                               DictionaryFactory mainDictFactory,
                               DictionaryFactory documentDictFactory)
                        throws java.io.IOException
Constructs a disk partition for a specific partition number.

Parameters:
partNum - the number of this partition
manager - the classifier manager for this partition
Throws:
java.io.IOException
Method Detail

getClassifier

protected ClassifierModel getClassifier(java.lang.String cname)

findSimilar

public void findSimilar(ClassifierModel cm,
                        java.util.Map<java.lang.String,java.lang.Float> scores)

getAllModels

protected ClassifierModel[] getAllModels()

invert

protected java.util.Map<java.lang.String,ClassificationFeature> invert()

getClassifier

protected ClassifierModel getClassifier(FeatureEntry fe)
Gets a classifier model from an entry in our document dictionary.


classify

public void classify(DiskPartition sdp,
                     ExtraClassification ec,
                     java.util.Map<java.lang.String,ClassificationResult> results)
Classifies all the documents in a disk partition. Uses the classifier model that is defined by the index/query configuration. The result is an array of collections of strings. Each position in the array corresponds to a document with the id of the position. The collection contains strings that represent the names of the classes to which the document belongs. If a position is null, the document belongs to no classes defined in this partition.

Parameters:
sdp - a disk partition
ec - a (possibly null) pair of field names. One is the name of the field from which classifiers were built. If this pair is non-null, then only classifiers that were built from the contents of the classifier from field in the pair will be considered. Also, if this pair is non-null then whatever classifiers are applied will be applied against the contents of the document from field in the pair. If this pair is null, then classification proceeds as usual.
results - a map to fill up with classification results

assembleResults

public int assembleResults(float[] scores,
                           java.lang.String modelName,
                           java.lang.String resultField,
                           java.util.Map<java.lang.String,ClassificationResult> results)

getFeatures

public java.util.Set getFeatures(java.lang.String cname)

makeFeatures

protected java.util.Set makeFeatures(FeatureEntry entry)

getDocumentVectorLength

public float getDocumentVectorLength(int docID)
Gets the length of a document vector for a given document. For classifier partitions, this is assumed to always be 1.

Overrides:
getDocumentVectorLength in class DiskPartition
Parameters:
docID - the ID of the document for whose vector we want the length
Returns:
1.

mergeCustom

protected void mergeCustom(int newPartNumber,
                           DiskPartition[] sortedParts,
                           int[][] idMaps,
                           int newMaxDocID,
                           int[] docIDStart,
                           int[] nUndel,
                           int[][] docIDMaps)
                    throws java.lang.Exception
Merges the model specific data for these classifiers.

Overrides:
mergeCustom in class DiskPartition
Parameters:
newPartNumber - the number of the new partition
sortedParts - the sorted list of partitions
idMaps - a set of maps from old entry ids in the main dictionary to new entry ids in the merged dictionary
newMaxDocID - the new maximum document id
docIDStart - the starting doc ids
nUndel - the number of undeleted documents in each partition
docIDMaps - doc id maps (see merge)
Throws:
java.lang.Exception

close

public boolean close()
Close the files associated with this partition.

Overrides:
close in class DiskPartition
Returns:
true if the files were successfully closed.

reap

protected static void reap(PartitionManager m,
                           int n)
Reaps the given classifier partition.

Parameters:
m - The manager associated with the partition.
n - The partition number to reap.