com.sun.labs.minion.indexer.dictionary
Class FeatureVector

java.lang.Object
  extended by com.sun.labs.minion.indexer.dictionary.FeatureVector
All Implemented Interfaces:
SavedField, java.lang.Comparable

public class FeatureVector
extends java.lang.Object
implements SavedField

A class that can be used to save feature vectors in an index. A feature vector is simply an array of doubles that represent the features. The width of the feature vector is determined by the first vector that is indexed. If subsequent values have a different width a warning will be issued.

Currently, this class will only store one feature vector per document.


Field Summary
protected  double[] features
          The features stored during indexing.
protected  FieldInfo fi
          The information for this field.
protected  int[] idToFeat
          A map from document IDs to the indices where feature vectors can be found in the stored features.
protected static java.lang.String logTag
           
protected  int pos
          The current position in the features array.
protected  int width
          The width of the feature vectors that we're storing.
 
Constructor Summary
FeatureVector(FieldInfo fi)
          Creates a FeatureVector that can be used to store data at indexing time.
FeatureVector(FieldInfo field, java.io.RandomAccessFile dictFile, java.io.RandomAccessFile[] postFiles, DiskPartition part)
          Constructs a feature vector field that will be used to retrieve data during querying.
 
Method Summary
 void add(int docID, java.lang.Object data)
          Adds data to this saved field.
 long bytesInUse()
           
 void clear()
          Clears a saved field, if it's open for indexing.
 int compareTo(java.lang.Object o)
           
 double distance(int id1, FeatureVector v, int id2)
          Gets the distance between two feature vectors stored in different partitions.
 double distance(int d1, int d2)
           
 void dump(java.lang.String path, java.io.RandomAccessFile dictFile, PostingsOutput[] postOut, int maxID)
          Dumps our saved data to the file.
 double[] euclideanDistance(double[] vec)
          Computes the Euclidean distance from the given document to all other documents.
 double euclideanDistance(double[] vec, int docID)
          Computes the Euclidean distance of the given feature vector to the vector for the given ID.
 double[] euclideanDistance(int docID)
          Computes the Euclidean distance from the given document to all other documents.
 QueryEntry get(java.lang.Object v, boolean caseSensitive)
          Unsupported operation.
 java.lang.Object getDefault()
          Gets the default value for a feature vector, which is null
 FieldInfo getField()
          Get the field info object for this field.
 java.lang.Object getSavedData(int docID, boolean all)
          Gets the data saved for a particular document ID.
 ArrayGroup getUndefined(ArrayGroup ag)
          Gets a group of all the documents that do not have any values saved for this field.
 DictionaryIterator iterator(java.lang.Object lowerBound, boolean includeLower, java.lang.Object upperBound, boolean includeUpper)
          Gets an iterator for the values in this field.
 void merge(java.lang.String path, SavedField[] fields, int maxID, int[] starts, int[] nUndel, int[][] docIDMaps, java.io.RandomAccessFile dictFile, PostingsOutput postOut)
          Merges a number of saved fields.
 int size()
          Gets the number of saved items that we're storing.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

fi

protected FieldInfo fi
The information for this field.


idToFeat

protected int[] idToFeat
A map from document IDs to the indices where feature vectors can be found in the stored features.


features

protected double[] features
The features stored during indexing.


pos

protected int pos
The current position in the features array.


width

protected int width
The width of the feature vectors that we're storing.


logTag

protected static java.lang.String logTag
Constructor Detail

FeatureVector

public FeatureVector(FieldInfo fi)
Creates a FeatureVector that can be used to store data at indexing time.


FeatureVector

public FeatureVector(FieldInfo field,
                     java.io.RandomAccessFile dictFile,
                     java.io.RandomAccessFile[] postFiles,
                     DiskPartition part)
              throws java.io.IOException
Constructs a feature vector field that will be used to retrieve data during querying.

Parameters:
field - The FieldInfo for this saved field.
dictFile - The file containing the dictionary for this field.
postFiles - The files containing the postings for this field.
part - The disk partition that this field is associated with.
Throws:
java.io.IOException - if there is any error loading the field data.
Method Detail

add

public void add(int docID,
                java.lang.Object data)
Adds data to this saved field. Assumes that data is an array of double
Specified by:
add in interface SavedField
Parameters:
docID - the document ID for the data we're adding.
data - the data to add. We assume that this is an array of double
Throws:
java.lang.ClassCastException - if data is not an array of double.

dump

public void dump(java.lang.String path,
                 java.io.RandomAccessFile dictFile,
                 PostingsOutput[] postOut,
                 int maxID)
          throws java.io.IOException
Dumps our saved data to the file. We won't actually store anything in the postings file, we'll just dump everything to the dictionary file.

Specified by:
dump in interface SavedField
Parameters:
path - The path of the index directory.
dictFile - The file where the dictionary will be written.
postOut - A place to write the postings associated with the values.
maxID - The maximum document ID for this partition.
Throws:
java.io.IOException - if there is an error during the writing.

get

public QueryEntry get(java.lang.Object v,
                      boolean caseSensitive)
Unsupported operation.

Specified by:
get in interface SavedField
Parameters:
v - The value to get.
caseSensitive - If true, case should be taken into account when iterating through the values. This value will only be observed for character fields!
Returns:
The term associated with that name, or null if that term doesn't occur in the indexed material.

getField

public FieldInfo getField()
Description copied from interface: SavedField
Get the field info object for this field.

Specified by:
getField in interface SavedField
Returns:
the FieldInfo

getSavedData

public java.lang.Object getSavedData(int docID,
                                     boolean all)
Gets the data saved for a particular document ID. If no data was stored for that ID, null is returned.

Specified by:
getSavedData in interface SavedField
Parameters:
docID - the document whose data we want
all - if true a list containing the single stored value for the document will be returned.
Returns:
the data

getUndefined

public ArrayGroup getUndefined(ArrayGroup ag)
Description copied from interface: SavedField
Gets a group of all the documents that do not have any values saved for this field.

Specified by:
getUndefined in interface SavedField
Parameters:
ag - a set of documents to which we should restrict the search for documents with undefined field values. If this is null then there is no such restriction.
Returns:
a set of documents that have no defined values for this field. This set may be restricted to documents occurring in the group that was passed in.

iterator

public DictionaryIterator iterator(java.lang.Object lowerBound,
                                   boolean includeLower,
                                   java.lang.Object upperBound,
                                   boolean includeUpper)
Description copied from interface: SavedField
Gets an iterator for the values in this field.

Specified by:
iterator in interface SavedField

size

public int size()
Description copied from interface: SavedField
Gets the number of saved items that we're storing.

Specified by:
size in interface SavedField

compareTo

public int compareTo(java.lang.Object o)
Specified by:
compareTo in interface java.lang.Comparable

clear

public void clear()
Description copied from interface: SavedField
Clears a saved field, if it's open for indexing.

Specified by:
clear in interface SavedField

bytesInUse

public long bytesInUse()

getDefault

public java.lang.Object getDefault()
Gets the default value for a feature vector, which is null


euclideanDistance

public double euclideanDistance(double[] vec,
                                int docID)
Computes the Euclidean distance of the given feature vector to the vector for the given ID.

Parameters:
vec - a feature vector
docID - the id of the document to which we want to compute the distance. If there is no data stored for this document, Double.POSITIVE_INFINITY is returned.

euclideanDistance

public double[] euclideanDistance(int docID)
Computes the Euclidean distance from the given document to all other documents.

Parameters:
docID - the document.
Returns:
an array of double, indexed by document ID. If there is no data associated with the document that we were given, null is returned. If a document does not have data associated with it, the value for that document will be Double.POSITIVE_INFINITY

euclideanDistance

public double[] euclideanDistance(double[] vec)
Computes the Euclidean distance from the given document to all other documents.

Parameters:
vec - the feature vector to which we're going to compute similarity.
Returns:
an array of double, indexed by document ID. If a document does not have data associated with it, the value for that document will be Double.POSITIVE_INFINITY

merge

public void merge(java.lang.String path,
                  SavedField[] fields,
                  int maxID,
                  int[] starts,
                  int[] nUndel,
                  int[][] docIDMaps,
                  java.io.RandomAccessFile dictFile,
                  PostingsOutput postOut)
           throws java.io.IOException
Description copied from interface: SavedField
Merges a number of saved fields.

Specified by:
merge in interface SavedField
Parameters:
path - The path to the index directory.
fields - An array of fields to merge.
maxID - The max doc ID in the new partition
starts - The new starting document IDs for the partitions.
nUndel - The number of undeleted documents in each partition
docIDMaps - A map for each partition from old document IDs to new document IDs. IDs that map to a value less than 0 have been deleted. A null array means that the old IDs are the new IDs.
dictFile - The file to which the merged dictionaries will be written.
postOut - The output to which the merged postings will be written.
Throws:
java.io.IOException - if there is an error during the merge.

distance

public double distance(int id1,
                       FeatureVector v,
                       int id2)
Gets the distance between two feature vectors stored in different partitions.

Parameters:
id1 - the id of the document containing the vector in this partition
v - the saved field holding the vector for the other partition
id2 - the id of the document containing the vector in the other partition
Returns:
the distance between the feature vectors, or Double.POSITIVE_INFINITY if either of the vector is undefined for the given IDs.

distance

public double distance(int d1,
                       int d2)