com.sun.labs.minion.retrieval
Class CompositeDocumentVectorImpl

java.lang.Object
  extended by com.sun.labs.minion.retrieval.CompositeDocumentVectorImpl
All Implemented Interfaces:
DocumentVector, java.lang.Cloneable

public class CompositeDocumentVectorImpl
extends java.lang.Object
implements DocumentVector

An implementation of document vector that provides for a composite document vector, that is, a document vector made by taking a linear combination of more than one vectored field.

See Also:
for an implementation that uses features from the whole document or from just one vectored field.

Field Summary
protected  SearchEngine e
          The search engine that generated this vector.
protected  WeightedFeature[][] fieldFeatures
          The per-field weighted features for the document.
protected  float[] fieldLengths
          The per-field lengths of the vectors.
protected  WeightedField[] fields
          A linear combination of the fields composing this vector.
protected  StopWords ignoreWords
           
protected  boolean initialized
          Whether we've had our features initialized.
protected static java.lang.String logTag
           
protected  boolean normalized
          Whether we've been normalized.
protected  WeightingComponents wc
          A set of weighting components that can be used when calculating term weights.
protected  WeightingFunction wf
          The weighting function to use for computing term weights.
 
Constructor Summary
protected CompositeDocumentVectorImpl()
           
  CompositeDocumentVectorImpl(ResultImpl r, WeightedField[] fields)
          Creates a document vector for a particular field from a search result.
  CompositeDocumentVectorImpl(SearchEngine e, DocKeyEntry key, WeightedField[] fields)
          Creates a document vector for a given document.
  CompositeDocumentVectorImpl(SearchEngine e, DocKeyEntry key, WeightedField[] fields, WeightingFunction wf, WeightingComponents wc)
           
  CompositeDocumentVectorImpl(SearchEngine e, WeightedFeature[] wf, WeightedField[] fields)
           
 
Method Summary
 DocumentVector copy()
          Creates a copy of the current document vector and returns it.
 float dot(CompositeDocumentVectorImpl dvi)
          Calculates the dot product of this document vector with another.
 float dot(WeightedFeature[] wfv1, WeightedFeature[] wfv2)
          Calculates the dot product of two sets of weighted features.
 ResultSet findSimilar()
          Finds similar documents to this one.
 ResultSet findSimilar(java.lang.String sortOrder)
          Finds documents that are similar to this one.
 ResultSet findSimilar(java.lang.String sortOrder, double skimPercent)
          Finds documents that are similar to this one.
 SearchEngine getEngine()
           
 DocKeyEntry getEntry()
           
 java.lang.String getKey()
          Gets the key for the document associated with this vector.
 java.util.SortedSet<WeightedFeature> getSet()
          Gets a sorted set of features.
 float getSimilarity(CompositeDocumentVectorImpl otherVector)
           
 float getSimilarity(DocumentVector otherVector)
          Computes the similarity between this document vector and the supplied vector.
 java.util.SortedSet<WeightedFeature> getSimilarityTerms(CompositeDocumentVectorImpl dvi)
          Gets a sorted (by weight) set of the terms contributing to document similarity with the provided document.
 java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector dv)
          Gets a map of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents.
 java.util.Set<java.lang.String> getTerms()
          Gets the set of terms in the document represented by this vector.
 java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
          Gets the n terms that have the highest document weight in this document vector.
 java.util.SortedSet<WeightedFeature> getWeightOrderedSet()
           
 void normalize()
          Normalizes the length of this vector to 1.
 void setEngine(SearchEngine e)
          Sets the search engine to use with this document vector.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface com.sun.labs.minion.DocumentVector
equals
 

Field Detail

e

protected transient SearchEngine e
The search engine that generated this vector.


fields

protected WeightedField[] fields
A linear combination of the fields composing this vector.


wf

protected transient WeightingFunction wf
The weighting function to use for computing term weights.


wc

protected transient WeightingComponents wc
A set of weighting components that can be used when calculating term weights.


fieldFeatures

protected WeightedFeature[][] fieldFeatures
The per-field weighted features for the document.


fieldLengths

protected float[] fieldLengths
The per-field lengths of the vectors.


initialized

protected boolean initialized
Whether we've had our features initialized.


normalized

protected boolean normalized
Whether we've been normalized.


logTag

protected static java.lang.String logTag

ignoreWords

protected transient StopWords ignoreWords
Constructor Detail

CompositeDocumentVectorImpl

protected CompositeDocumentVectorImpl()

CompositeDocumentVectorImpl

public CompositeDocumentVectorImpl(ResultImpl r,
                                   WeightedField[] fields)
Creates a document vector for a particular field from a search result.

Parameters:
r - The search result for which we want a document vector.
fields - a linear combination of fields and weights that should be used to build this document vector. The field names provided in the array should be the names of vectored fields. If a provided field name does not name a vectored field, a warning will be logged, but the operation will proceed.

If this paramater contains a weighted field whose name is null, that indicates that the data from the unnamed body field should be used with the associated weight.

It is probably a good idea if the weights associated with the fields sum to 1, although it is not required. If the weights do not sum to one, then you may get document similarities greater than 1 as the result of a findSimilar operation.


CompositeDocumentVectorImpl

public CompositeDocumentVectorImpl(SearchEngine e,
                                   DocKeyEntry key,
                                   WeightedField[] fields)
Creates a document vector for a given document.

Parameters:
e - The search engine with which the docuemnt is associated.
key - The entry from the document dictionary for the given document.
fields - a linear combination of the vectored fields in this document that we will use to build the document vector. If this value is null a vector for the whole document will be returned. If one of the values in a non-null array has the name of the field set to null, then the vector will include data from the unnamed body field. If one of the fields provided is not a vectored field, then a warning will be issued, but processing will proceed.

CompositeDocumentVectorImpl

public CompositeDocumentVectorImpl(SearchEngine e,
                                   DocKeyEntry key,
                                   WeightedField[] fields,
                                   WeightingFunction wf,
                                   WeightingComponents wc)

CompositeDocumentVectorImpl

public CompositeDocumentVectorImpl(SearchEngine e,
                                   WeightedFeature[] wf,
                                   WeightedField[] fields)
Method Detail

copy

public DocumentVector copy()
Description copied from interface: DocumentVector
Creates a copy of the current document vector and returns it.

Specified by:
copy in interface DocumentVector
Returns:
a copy of the current document vector

getEntry

public DocKeyEntry getEntry()

getEngine

public SearchEngine getEngine()

setEngine

public void setEngine(SearchEngine e)
Description copied from interface: DocumentVector
Sets the search engine to use with this document vector.

Specified by:
setEngine in interface DocumentVector
Parameters:
e - the engine

dot

public float dot(CompositeDocumentVectorImpl dvi)
Calculates the dot product of this document vector with another. Because a composite document vector is composed of multiple fields with associated weights, the dot product of two vectors will take these fields and weights into account.

When the other document vector contains a field that this one does not, (or vice versa (note: it's probably not a good idea to compute the dot product of such vectors!)), then there will be no contribution from that field. When the two document vectors have a field in common, the of the vectors will be multiplied by any associated field weights before they are multiplied together.

Parameters:
dvi - another document vector
Returns:
the dot product of the two vectors (i.e. the sum of the products of the components in each dimension)

dot

public float dot(WeightedFeature[] wfv1,
                 WeightedFeature[] wfv2)
Calculates the dot product of two sets of weighted features. Assumes that the arrays are ordered by the feature names.

Parameters:
wfv1 - a weighted feature vector
wfv2 - another weighted feature vector
Returns:
the dot product of the two vectors (i.e. the sum of the products of the components in each dimension)

getTerms

public java.util.Set<java.lang.String> getTerms()
Description copied from interface: DocumentVector
Gets the set of terms in the document represented by this vector.

Specified by:
getTerms in interface DocumentVector
Returns:
a set of the terms in the document.

getSimilarityTerms

public java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector dv)
Gets a map of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents. Only terms that occur in both documents are returned, as all other terms have weight zero. The keys in the Map are sorted according to the natural ordering of their values. That is, the first string returned from an iterator over the key set will be the term with the highest weight.

Specified by:
getSimilarityTerms in interface DocumentVector
Parameters:
dv - the document vector to compare this one to
Returns:
a sorted hash map of String names to Float weights

getSimilarityTerms

public java.util.SortedSet<WeightedFeature> getSimilarityTerms(CompositeDocumentVectorImpl dvi)
Gets a sorted (by weight) set of the terms contributing to document similarity with the provided document. The set consists of WeightedFeatures that represent the terms that the documents have in common and their combined weights.

Parameters:
dvi - the document to compare this one to
Returns:
a sorted set of WeightedFeature that occurred in both documents

normalize

public void normalize()
Normalizes the length of this vector to 1. Since this is a composite vector, we normalize each of the composites. Not sure this actually makes much sense.


getSet

public java.util.SortedSet<WeightedFeature> getSet()
Gets a sorted set of features.

Returns:
a set of the features in this vector, sorted by name. If this document vector is composed of multiple fields with associated weights, these features will take these weights into account.

getWeightOrderedSet

public java.util.SortedSet<WeightedFeature> getWeightOrderedSet()

getSimilarity

public float getSimilarity(DocumentVector otherVector)
Computes the similarity between this document vector and the supplied vector. The larger the value, the greater the similarity. The measurement returned is the cosine of the angle between the vectors.

Specified by:
getSimilarity in interface DocumentVector
Parameters:
otherVector - the vector representing the document to compare this vector to
Returns:
the cosine of the angle between the two vectors

getSimilarity

public float getSimilarity(CompositeDocumentVectorImpl otherVector)

findSimilar

public ResultSet findSimilar()
Finds similar documents to this one. An OR is run with all the terms in the documents. The resulting docs are returned ordered from most similar to least similar.

Specified by:
findSimilar in interface DocumentVector
Returns:
documents similar to the one this vector represents

findSimilar

public ResultSet findSimilar(java.lang.String sortOrder)
Description copied from interface: DocumentVector
Finds documents that are similar to this one.

Specified by:
findSimilar in interface DocumentVector
Parameters:
sortOrder - a string describing the order in which to sort the results
Returns:
documents similar to the one this vector represents

findSimilar

public ResultSet findSimilar(java.lang.String sortOrder,
                             double skimPercent)
Finds documents that are similar to this one. An OR is run with all the terms in the documents. The resulting docs are returned ordered from most similar to least similar.

Specified by:
findSimilar in interface DocumentVector
Parameters:
sortOrder - a string describing the order in which to sort the results
skimPercent - a number between 0 and 1 representing what percent of the features should be used to perform findSimilar
Returns:
documents similar to the one this vector represents

getTopWeightedTerms

public java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
Description copied from interface: DocumentVector
Gets the n terms that have the highest document weight in this document vector. The results are expressed as a HashMap from String term names to Float term weights. The term weights are normalized.

Specified by:
getTopWeightedTerms in interface DocumentVector
Parameters:
nTerms - the number of terms to return
Returns:
a HashMap from Strings to Floats with at most nTerms terms (fewer if the document contains fewer than nTerms terms)

getKey

public java.lang.String getKey()
Description copied from interface: DocumentVector
Gets the key for the document associated with this vector.

Specified by:
getKey in interface DocumentVector
Returns:
the key for this document.

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object