com.sun.labs.minion.retrieval
Class DocumentVectorImpl

java.lang.Object
  extended by com.sun.labs.minion.retrieval.DocumentVectorImpl
All Implemented Interfaces:
DocumentVector, java.io.Serializable, java.lang.Cloneable
Direct Known Subclasses:
MultiDocumentVectorImpl

public class DocumentVectorImpl
extends java.lang.Object
implements DocumentVector, java.io.Serializable

A class that holds a weighted document vector for a given document from a given partition. This implementation is meant to handle features from either the entire document or a single vectored field.

See Also:
for an implementation that can handle features from multiple vectored fields., Serialized Form

Field Summary
protected  SearchEngine e
          The search engine that generated this vector.
protected  java.lang.String field
           
protected  int fieldID
           
protected  int[] fields
          The field from which this document vector was generated.
protected  StopWords ignoreWords
           
protected  DocKeyEntry key
          The document key for this entry.
protected  java.lang.String keyName
          The name of the key, which will survive transport.
protected  float length
          The length of this document vector.
protected static java.lang.String logTag
           
protected  boolean normalized
          Whether we've been normalized.
protected  QueryStats qs
           
protected  WeightedFeature[] v
          An array to hold the features that make up our vector.
protected  WeightingComponents wc
          A set of weighting components that can be used when calculating term weights.
protected  WeightingFunction wf
          The weighting function to use for computing term weights.
 
Constructor Summary
protected DocumentVectorImpl()
           
  DocumentVectorImpl(ResultImpl r)
          Creates a document vector from a search result.
  DocumentVectorImpl(ResultImpl r, java.lang.String field)
          Creates a document vector for a particular field from a search result.
  DocumentVectorImpl(SearchEngine e, DocKeyEntry key, java.lang.String field)
          Creates a document vector for a given document.
  DocumentVectorImpl(SearchEngine e, DocKeyEntry key, java.lang.String field, WeightingFunction wf, WeightingComponents wc)
           
  DocumentVectorImpl(SearchEngine e, WeightedFeature[] basisFeatures)
           
 
Method Summary
 DocumentVector copy()
          Creates a copy of the current document vector and returns it.
 float dot(DocumentVectorImpl dvi)
          Calculates the dot product of this document vector with another.
 float dot(WeightedFeature[] wfv)
          Calculates the dot product of this feature vector and another feature vector.
 boolean equals(java.lang.Object dv)
          Two document vectors are equal if all their weighted features are equal (in both name and weight)
 ResultSet findSimilar()
          Finds similar documents to this one.
 ResultSet findSimilar(java.lang.String sortOrder)
          Finds documents that are similar to this one.
 ResultSet findSimilar(java.lang.String sortOrder, double skimPercent)
          Finds similar documents to this one.
 SearchEngine getEngine()
           
 DocKeyEntry getEntry()
           
 WeightedFeature[] getFeatures()
           
 java.lang.String getKey()
          Gets the key for the document associated with this vector.
 java.util.SortedSet getSet()
          Gets a sorted set of features.
 float getSimilarity(DocumentVector otherVector)
          Computes the similarity between this document vector and the supplied vector.
 float getSimilarity(DocumentVectorImpl otherVector)
           
 java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector dv)
          Gets a map of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents.
 java.util.SortedSet getSimilarityTerms(DocumentVectorImpl dvi)
          Gets a sorted (by weight) set of the terms contributing to document similarity with the provided document.
 java.util.Set<java.lang.String> getTerms()
          Gets the set of terms in the document represented by this vector.
 java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
          Gets the n terms that have the highest document weight in this document vector.
 java.util.SortedSet getWeightOrderedSet()
           
 float length()
          Gets the euclidean length of this vector.
 void normalize()
          Normalizes the length of this vector to 1.
 void setEngine(SearchEngine e)
          Sets the search engine that this vector will use, which is useful when we've been unserialized and need to get ourselves back into shape.
 void setField(java.lang.String field)
           
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

e

protected transient SearchEngine e
The search engine that generated this vector.


key

protected transient DocKeyEntry key
The document key for this entry.


keyName

protected java.lang.String keyName
The name of the key, which will survive transport.


fields

protected transient int[] fields
The field from which this document vector was generated.


wf

protected transient WeightingFunction wf
The weighting function to use for computing term weights.


wc

protected transient WeightingComponents wc
A set of weighting components that can be used when calculating term weights.


v

protected WeightedFeature[] v
An array to hold the features that make up our vector. This array must be ordered by feature name!


length

protected float length
The length of this document vector.


normalized

protected boolean normalized
Whether we've been normalized.


qs

protected QueryStats qs

logTag

protected static java.lang.String logTag

ignoreWords

protected transient StopWords ignoreWords

field

protected java.lang.String field

fieldID

protected int fieldID
Constructor Detail

DocumentVectorImpl

protected DocumentVectorImpl()

DocumentVectorImpl

public DocumentVectorImpl(ResultImpl r)
Creates a document vector from a search result.

Parameters:
r - The search result for which we want a document vector.

DocumentVectorImpl

public DocumentVectorImpl(ResultImpl r,
                          java.lang.String field)
Creates a document vector for a particular field from a search result.

Parameters:
r - The search result for which we want a document vector.
field - The name of the field for which we want the document vector. If this value is null a vector for the whole document will be returned. If the named field is not a field that was indexed with the vectored attribute set, the resulting document vector will be empty!

DocumentVectorImpl

public DocumentVectorImpl(SearchEngine e,
                          WeightedFeature[] basisFeatures)

DocumentVectorImpl

public DocumentVectorImpl(SearchEngine e,
                          DocKeyEntry key,
                          java.lang.String field)
Creates a document vector for a given document.

Parameters:
e - The search engine with which the docuemnt is associated.
key - The entry from the document dictionary for the given document.
field - The name of the field for which we want the document vector. If this value is null a vector for the whole document will be returned. If this value is the empty string, then a vector for the text not in any defined field will be returned. If the named field is not a field that was indexed with the vectored attribute set, the resulting document vector will be empty!

DocumentVectorImpl

public DocumentVectorImpl(SearchEngine e,
                          DocKeyEntry key,
                          java.lang.String field,
                          WeightingFunction wf,
                          WeightingComponents wc)
Method Detail

copy

public DocumentVector copy()
Description copied from interface: DocumentVector
Creates a copy of the current document vector and returns it.

Specified by:
copy in interface DocumentVector
Returns:
a copy of the current document vector

getFeatures

public WeightedFeature[] getFeatures()

setEngine

public void setEngine(SearchEngine e)
Sets the search engine that this vector will use, which is useful when we've been unserialized and need to get ourselves back into shape.

Specified by:
setEngine in interface DocumentVector
Parameters:
e - the engine to use

getEntry

public DocKeyEntry getEntry()

getEngine

public SearchEngine getEngine()

dot

public float dot(DocumentVectorImpl dvi)
Calculates the dot product of this document vector with another.

Parameters:
dvi - another document vector
Returns:
the dot product of the two vectors (i.e. the sum of the products of the components in each dimension)

dot

public float dot(WeightedFeature[] wfv)
Calculates the dot product of this feature vector and another feature vector.

Parameters:
wfv - a weighted feature vector
Returns:
the dot product of the two vectors (i.e. the sum of the products of the components in each dimension)

equals

public boolean equals(java.lang.Object dv)
Two document vectors are equal if all their weighted features are equal (in both name and weight)

Specified by:
equals in interface DocumentVector
Overrides:
equals in class java.lang.Object
Parameters:
dv - the document vector to compare this one to
Returns:
true if the document vectors have equal weighed features

getTerms

public java.util.Set<java.lang.String> getTerms()
Description copied from interface: DocumentVector
Gets the set of terms in the document represented by this vector.

Specified by:
getTerms in interface DocumentVector
Returns:
a set of the terms in the document.

getSimilarityTerms

public java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector dv)
Gets a map of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents. Only terms that occur in both documents are returned, as all other terms have weight zero. The keys in the HashMap are sorted according to the natural ordering of their values. That is, the first string returned from an iterator over the key set will be the term with the highest weight.

Specified by:
getSimilarityTerms in interface DocumentVector
Parameters:
dv - the document vector to compare this one to
Returns:
a sorted hash map of String names to Float weights

getSimilarityTerms

public java.util.SortedSet getSimilarityTerms(DocumentVectorImpl dvi)
Gets a sorted (by weight) set of the terms contributing to document similarity with the provided document. The set consists of WeightedFeatures that represent the terms that each document have in common and their combined weights.

Parameters:
dvi - the document to compare this one to
Returns:
a sorted set of WeightedFeature that occurred in both documents

normalize

public void normalize()
Normalizes the length of this vector to 1.


length

public float length()
Gets the euclidean length of this vector.


getSet

public java.util.SortedSet getSet()
Gets a sorted set of features.

Returns:
a set of the features in this vector, sorted by name

getWeightOrderedSet

public java.util.SortedSet getWeightOrderedSet()

getSimilarity

public float getSimilarity(DocumentVector otherVector)
Computes the similarity between this document vector and the supplied vector. The larger the value, the greater the similarity. The measurement returned is the cosine of the angle between the vectors.

Specified by:
getSimilarity in interface DocumentVector
Parameters:
otherVector - the vector representing the document to compare this vector to
Returns:
the cosine of the angle between the two vectors

getSimilarity

public float getSimilarity(DocumentVectorImpl otherVector)

findSimilar

public ResultSet findSimilar()
Finds similar documents to this one. An OR is run with all the terms in the documents. The resulting docs are returned ordered from most similar to least similar.

Specified by:
findSimilar in interface DocumentVector
Returns:
documents similar to the one this vector represents

findSimilar

public ResultSet findSimilar(java.lang.String sortOrder)
Description copied from interface: DocumentVector
Finds documents that are similar to this one.

Specified by:
findSimilar in interface DocumentVector
Parameters:
sortOrder - a string describing the order in which to sort the results
Returns:
documents similar to the one this vector represents

findSimilar

public ResultSet findSimilar(java.lang.String sortOrder,
                             double skimPercent)
Finds similar documents to this one. An OR is run with all the terms in the documents. The resulting docs are returned ordered from most similar to least similar.

Specified by:
findSimilar in interface DocumentVector
Parameters:
sortOrder - a string describing the order in which to sort the results
skimPercent - a number between 0 and 1 representing what percent of the features should be used to perform findSimilar
Returns:
documents similar to the one this vector represents

getTopWeightedTerms

public java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
Description copied from interface: DocumentVector
Gets the n terms that have the highest document weight in this document vector. The results are expressed as a HashMap from String term names to Float term weights. The term weights are normalized.

Specified by:
getTopWeightedTerms in interface DocumentVector
Parameters:
nTerms - the number of terms to return
Returns:
a HashMap from Strings to Floats with at most nTerms terms (fewer if the document contains fewer than nTerms terms)

getKey

public java.lang.String getKey()
Description copied from interface: DocumentVector
Gets the key for the document associated with this vector.

Specified by:
getKey in interface DocumentVector
Returns:
the key for this document.

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

setField

public void setField(java.lang.String field)