com.sun.labs.minion.indexer.postings
Class DocumentVectorPostings

java.lang.Object
  extended by com.sun.labs.minion.indexer.postings.IDPostings
      extended by com.sun.labs.minion.indexer.postings.IDFreqPostings
          extended by com.sun.labs.minion.indexer.postings.DocumentVectorPostings
All Implemented Interfaces:
MergeablePostings, Postings
Direct Known Subclasses:
ClusterPostings

public class DocumentVectorPostings
extends IDFreqPostings
implements MergeablePostings

A class to hold postings for the document vectors. For these postings, the IDs that we store are the IDs of the terms that occurred in the document. Along with the IDs, we store the frequency of occurrence of each term.

During indexing, we will encounter term IDs in a (seemingly) random order, so in this case we store the IDs and frequencies in an array of integers. At dump time, we use the remap method to remap the IDs in the postings to the renumbered IDs from the main dictionary and we actually encode the data onto the buffer.

Along with the usual functionalities, these postings will calculate document vector lengths at postings dump and merge time so that the lengths are readily available from document dictionary entries.


Nested Class Summary
 
Nested classes/interfaces inherited from class com.sun.labs.minion.indexer.postings.IDFreqPostings
IDFreqPostings.IDFreqIterator
 
Nested classes/interfaces inherited from class com.sun.labs.minion.indexer.postings.IDPostings
IDPostings.IDIterator
 
Field Summary
protected  java.util.Map<java.lang.Object,com.sun.labs.minion.indexer.postings.DocumentVectorPostings.EntryFreq> entries
          Storage for the entries making up this set of postings.
protected static java.lang.String logTag
           
 
Fields inherited from class com.sun.labs.minion.indexer.postings.IDFreqPostings
freq, freqs, maxfdt, to
 
Fields inherited from class com.sun.labs.minion.indexer.postings.IDPostings
curr, dataStart, ids, lastID, nIDs, nSkips, post, prevID, skipID, skipPos, skipSize
 
Constructor Summary
DocumentVectorPostings()
          Creates a set of postings suitable for use during indexing.
DocumentVectorPostings(ReadableBuffer b)
          Creates a set of postings suitable for use during querying.
 
Method Summary
 void add(Occurrence o)
          Adds an occurrence to the postings.
 void finish()
          Finishes off the encoding, which does nothing in this case.
 WeightedFeature[] getWeightedFeatures(int docID, int fieldID, Dictionary dict, WeightingFunction wf, WeightingComponents wc)
          Gets the entries in this set of postings as an array of weighted features.
 void merge(MergeablePostings mp, int[] map)
          Merges another set of postings with this set of postings.
 void remap(int[] idMap)
          Remaps the IDs in the postings, using the provided ID map.
 int size()
          Estimates the size of the postings associated with this document.
 
Methods inherited from class com.sun.labs.minion.indexer.postings.IDFreqPostings
encode, getMaxFDT, getTotalOccurrences, iterator, recodeID
 
Methods inherited from class com.sun.labs.minion.indexer.postings.IDPostings
addSkip, append, append, getBuffers, getLastID, getN, setSkipSize, skip
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

entries

protected java.util.Map<java.lang.Object,com.sun.labs.minion.indexer.postings.DocumentVectorPostings.EntryFreq> entries
Storage for the entries making up this set of postings.


logTag

protected static java.lang.String logTag
Constructor Detail

DocumentVectorPostings

public DocumentVectorPostings()
Creates a set of postings suitable for use during indexing.


DocumentVectorPostings

public DocumentVectorPostings(ReadableBuffer b)
Creates a set of postings suitable for use during querying.

Parameters:
b - a buffer containing the encoded postings.
Method Detail

add

public void add(Occurrence o)
Adds an occurrence to the postings. We're just keeping a set of the entries added to a document.

Specified by:
add in interface Postings
Overrides:
add in class IDFreqPostings
Parameters:
o - the occurrence to add.

merge

public void merge(MergeablePostings mp,
                  int[] map)
Description copied from interface: MergeablePostings
Merges another set of postings with this set of postings.

Specified by:
merge in interface MergeablePostings
Overrides:
merge in class IDFreqPostings
Parameters:
mp - the postings to merge into these postings.
map - a map from IDs in the postings to IDs in the merged space.

size

public int size()
Estimates the size of the postings associated with this document.

Specified by:
size in interface Postings
Overrides:
size in class IDPostings

finish

public void finish()
Finishes off the encoding, which does nothing in this case.

Specified by:
finish in interface Postings
Overrides:
finish in class IDFreqPostings

remap

public void remap(int[] idMap)
Remaps the IDs in the postings, using the provided ID map. This encodes the postings onto a buffer and as a side effect calculates any necessary document statistics.

Specified by:
remap in interface Postings
Overrides:
remap in class IDPostings
Parameters:
idMap - a map from old IDs to new IDs.

getWeightedFeatures

public WeightedFeature[] getWeightedFeatures(int docID,
                                             int fieldID,
                                             Dictionary dict,
                                             WeightingFunction wf,
                                             WeightingComponents wc)
Gets the entries in this set of postings as an array of weighted features. We will attempt to do this as efficiently as possible. In particular, if the number of terms in the document is a substantial proportion of the number of terms in the dictionary for this partition, we will attempt to iterate through the dictinoary in such a way as to minimize the number of dictionary lookups required.

Parameters:
docID - the id of this document, if it is in an already dumped partition.
fieldID - the id of the field from which the postings were drawn
dict - a dictionary that we can use to fetch term names when all we have is IDs.
wf - a weighting function to use to weight the entries in the document vector.
wc - a set of weighting components to use in the weighting fucntion.
Returns:
an array of weighted features corresponding to the terms in this document. Note that the getEntry method for these features will return the dictionary entry for the term from the partition holding the document. This is a convenience to avoid multiple dictionary lookups in this paritition.