com.sun.labs.minion.retrieval
Class WeightingComponents

java.lang.Object
  extended by com.sun.labs.minion.retrieval.WeightingComponents

public class WeightingComponents
extends java.lang.Object

A class that will hold all of the components necessary to implement any number of weighting functions. The names and descriptions here are (mostly) taken from the Moffat and Zobel paper Exploring the Similarity Space.

The components that this class contains comprise statistics at two levels of description. First, there are the collection-level statistics that are calculated across all of the partitions contained in an index. Second, there are the document-level statistics that are set per term or document being processed, depending on the context.

For example, in typical query processing scenarios we will create a set of weighting components from the collection statistics at the start of query evaluation. As each term in the query is processed, we will set the document-level term statistics using the setTerm(java.lang.String) method. As we process each document in the postings list associated with a term, we will set the document-level statistics directly.

Note that this class is provided as a convienience and is merely intended as a container into which a number of statistics can be placed. There is no checking done with regard to the validity of the statistics that are placed into it. The use of inappropriate statistics may lead to strange results when calculating term weights.


Field Summary
 float avgDocLen
          The average document length, in words.
 CollectionStats cs
          A set of collection statistics.
 float dvl
          The length of the document vector for the current document.
 int fdt
          The frequency of term t in document d.
 int ft
          The total number of documents containing term t.
 long Ft
          The total number of occurrences of term t in the whole collection.
 long ld
          The total number of words in document d.
protected static java.lang.String logTag
           
 int maxfdt
          The maximum term frequency in the collection.
 int maxft
          The maximum document frequency in the collection.
 int n
          The number of distinct terms in the collection.
 int N
          The total number of documents in the collection.
 int nd
          The number of distinct terms in document d.
 long nTokens
          The number of tokens in the collection, i.e., the sum of the lengths of all the documents.
 TermStatsImpl ts
          The statistics that we were given or that we retrieved for the last call to setTerm.
 float wt
          A collection level term weight.
 
Constructor Summary
WeightingComponents()
          Creates a set of weighting components.
WeightingComponents(CollectionStats s)
          Initalizes a set of weighting components from a set of collection statistics.
 
Method Summary
 TermStatsImpl getTermStats()
           
 TermStatsImpl getTermStats(java.lang.String term)
           
 WeightingComponents setCollection(CollectionStats s)
          Initializes the collection-level statistics.
 WeightingComponents setDocument(DocKeyEntry key)
          Initializes any document-level statistics that can be determined from a document key.
 WeightingComponents setDocument(DocKeyEntry key, java.lang.String field)
           
 WeightingComponents setDocument(PostingsIterator pi)
          Initalizes any per-document statistics that can be gotten from a postings iterator.
 WeightingComponents setTerm(java.lang.String name)
          Initializes any document-level statistics that can be determined from a term.
 WeightingComponents setTerm(TermStatsImpl s)
          Initializes any document-level statistics that can be determined from a set of term statistics.
 void setTermStats(java.lang.String term, TermStatsImpl ts)
           
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

cs

public CollectionStats cs
A set of collection statistics. If this value is set by the constructor or by the setCollection method, then the weighting components can handle term statistics lookups on their own.


ts

public TermStatsImpl ts
The statistics that we were given or that we retrieved for the last call to setTerm.


N

public int N
The total number of documents in the collection. Collection-level statistic.


n

public int n
The number of distinct terms in the collection. This is very likely an overestimate, as many terms will be shared in the various partitions' main dictionaries. Collection-level statistic.


nTokens

public long nTokens
The number of tokens in the collection, i.e., the sum of the lengths of all the documents. Collection-level statistic.


fdt

public int fdt
The frequency of term t in document d. Document-level statistic.


Ft

public long Ft
The total number of occurrences of term t in the whole collection. Document-level statistic.


ft

public int ft
The total number of documents containing term t. Document-level statistic.


maxfdt

public int maxfdt
The maximum term frequency in the collection. For all terms t in the collection and all documents d in the partition, this is the maximum value of fd,t, the frequency of term t in document d. Collection-level statistic.


maxft

public int maxft
The maximum document frequency in the collection. This is given by the term that has the largest number of documents associated with it, across all dictionaries in the collection. This will most likely be an underestimate, as it most likely will not take into account the fact that the same term occurs in more than one partition! Collection-level statistic.


nd

public int nd
The number of distinct terms in document d. Document-level statistic.


ld

public long ld
The total number of words in document d. Document-level statistic.


dvl

public float dvl
The length of the document vector for the current document.


avgDocLen

public float avgDocLen
The average document length, in words. Collection-level statistic.


wt

public float wt
A collection level term weight.

See Also:
WeightingFunction.initTerm(WeightingComponents)

logTag

protected static java.lang.String logTag
Constructor Detail

WeightingComponents

public WeightingComponents()
Creates a set of weighting components.


WeightingComponents

public WeightingComponents(CollectionStats s)
Initalizes a set of weighting components from a set of collection statistics. This will initialize all of the collection-level statistics, but will leave the document-level statistics at their default values.

See Also:
setTerm(java.lang.String)
Method Detail

setCollection

public WeightingComponents setCollection(CollectionStats s)
Initializes the collection-level statistics.

Returns:
this set of weighting components.

setTerm

public WeightingComponents setTerm(java.lang.String name)
Initializes any document-level statistics that can be determined from a term. This method requires a set of collection statistics to have been set at instantiation time or via the setCollection method. If there are no such statistics a warning is issued and the components in this object will not be modified!

Parameters:
name - the name of the term whose statistics we need.
Returns:
this set of weighting components.

setTerm

public WeightingComponents setTerm(TermStatsImpl s)
Initializes any document-level statistics that can be determined from a set of term statistics.

Parameters:
s - a set of statistics for a term.
Returns:
this set of weighting components.

getTermStats

public TermStatsImpl getTermStats()

getTermStats

public TermStatsImpl getTermStats(java.lang.String term)

setTermStats

public void setTermStats(java.lang.String term,
                         TermStatsImpl ts)

setDocument

public WeightingComponents setDocument(DocKeyEntry key)
Initializes any document-level statistics that can be determined from a document key.

Parameters:
key - a document key entry from a dicitionary.
Returns:
this set of weighting components.

setDocument

public WeightingComponents setDocument(DocKeyEntry key,
                                       java.lang.String field)

setDocument

public WeightingComponents setDocument(PostingsIterator pi)
Initalizes any per-document statistics that can be gotten from a postings iterator.

Parameters:
pi - a postings iterator that is being processed
Returns:
this set of weighting components.

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object