com.sun.labs.minion
Interface DocumentVector

All Superinterfaces:
java.lang.Cloneable
All Known Implementing Classes:
CompositeDocumentVectorImpl, DocumentVectorImpl, MultiDocumentVectorImpl

public interface DocumentVector
extends java.lang.Cloneable

An interface defining the behavior of document vectors. These are the basis for all classification, clustering, and profiling activity. A document vector can be obtained from a search result using the Result.getDocumentVector() method.

The name is a bit misleading: an instance of this class can be used to represent a set of documents as easily as it can a single document.


Method Summary
 DocumentVector copy()
          Creates a copy of the current document vector and returns it.
 boolean equals(java.lang.Object o)
          Determines of two document vectors are equal.
 ResultSet findSimilar()
          Finds documents that are similar to this one.
 ResultSet findSimilar(java.lang.String sortOrder)
          Finds documents that are similar to this one.
 ResultSet findSimilar(java.lang.String sortOrder, double skimPercent)
          Finds documents that are similar to this one.
 java.lang.String getKey()
          Gets the key for the document associated with this vector.
 float getSimilarity(DocumentVector vector)
          Computes the similarity between this document vector and the supplied vector.
 java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector vector)
          Gets a HashMap of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents.
 java.util.Set<java.lang.String> getTerms()
          Gets the set of terms in the document represented by this vector.
 java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
          Gets the n terms that have the highest document weight in this document vector.
 void setEngine(SearchEngine e)
          Sets the search engine to use with this document vector.
 

Method Detail

copy

DocumentVector copy()
Creates a copy of the current document vector and returns it.

Returns:
a copy of the current document vector

setEngine

void setEngine(SearchEngine e)
Sets the search engine to use with this document vector.

Parameters:
e - the engine

equals

boolean equals(java.lang.Object o)
Determines of two document vectors are equal. Document vectors are equal each of their terms is equal in both name and weight.

Overrides:
equals in class java.lang.Object
Parameters:
o - the document vector to which this vector is compared
Returns:
true if the two document vectors are equal, false otherwise

getSimilarity

float getSimilarity(DocumentVector vector)
Computes the similarity between this document vector and the supplied vector. Similarity is as how small or large the angle between the two vectors is. The measurement returned is the cosine of the angle between the vectors.

Parameters:
vector - the vector representing the document to compare this vector to
Returns:
the cosine of the angle between the two vectors

findSimilar

ResultSet findSimilar()
Finds documents that are similar to this one.

Returns:
documents similar to the one this vector represents

findSimilar

ResultSet findSimilar(java.lang.String sortOrder)
Finds documents that are similar to this one.

Parameters:
sortOrder - a string describing the order in which to sort the results
Returns:
documents similar to the one this vector represents

findSimilar

ResultSet findSimilar(java.lang.String sortOrder,
                      double skimPercent)
Finds documents that are similar to this one.

Parameters:
sortOrder - a string describing the order in which to sort the results
skimPercent - a number between 0 and 1 representing what percent of the features should be used to perform findSimilar
Returns:
documents similar to the one this vector represents

getKey

java.lang.String getKey()
Gets the key for the document associated with this vector.

Returns:
the key for this document.

getTerms

java.util.Set<java.lang.String> getTerms()
Gets the set of terms in the document represented by this vector.

Returns:
a set of the terms in the document.

getTopWeightedTerms

java.util.Map<java.lang.String,java.lang.Float> getTopWeightedTerms(int nTerms)
Gets the n terms that have the highest document weight in this document vector. The results are expressed as a HashMap from String term names to Float term weights. The term weights are normalized.

Parameters:
nTerms - the number of terms to return
Returns:
a HashMap from Strings to Floats with at most nTerms terms (fewer if the document contains fewer than nTerms terms)

getSimilarityTerms

java.util.Map<java.lang.String,java.lang.Float> getSimilarityTerms(DocumentVector vector)
Gets a HashMap of term names to weights, where the weights represent the amount the term contributed to the similarity of the two documents. Only terms that occur in both documents are returned, as all other terms have weight zero. The keys in the HashMap are sorted according to the natural ordering of their values. That is, the first string returned from an iterator over the key set will be the term with the highest weight.

Parameters:
vector - the document vector to compare this one to
Returns:
a sorted hash map of String names to Float weights