com.sun.labs.minion.indexer.partition
Class DocumentVectorLengths

java.lang.Object
  extended by com.sun.labs.minion.indexer.partition.DocumentVectorLengths
Direct Known Subclasses:
CachedDocumentVectorLengths

public class DocumentVectorLengths
extends java.lang.Object

A class that holds the document vector lengths for a partition. It can be used at indexing or query time to build the document vector lengths and dump them to disk. The stored document vector lengths are used for document vector length normalization during querying and classification operations.

The lengths are represented using a file-backed buffer.


Field Summary
protected static int BUFF_SIZE
          A standard buffer size to use, in bytes.
protected  ReadableBuffer[] fieldLens
          Buffers containing vector lengths for the vectored fields.
protected static java.lang.String logTag
           
protected  DiskPartition part
          The partition whose values we're storing.
protected  java.io.RandomAccessFile raf
          The random access file that we'll use to back our buffer.
protected  ReadableBuffer vecLens
          A buffer containing the vector lengths for the whole document.
protected  java.io.File vlFile
          The file that (will) contain the document vector lengths.
 
Constructor Summary
DocumentVectorLengths(DiskPartition part, boolean adjustStats)
          Creates a set of vector lengths for a given partition.
DocumentVectorLengths(DiskPartition part, int buffSize, boolean adjustStats)
          Creates a set of vector lengths for a given partition.
 
Method Summary
 void calculateLengths(DiskPartition p, TermStatsDictionary gts, boolean adjustStats)
          Calculates a set of document vector lengths from a partition using a global set of term statistics.
 void close()
          Closes the file associated with the document lengths.
 float getVectorLength(int docID)
          Gets the length of a document associated with this partition.
 float getVectorLength(int docID, int fieldID)
          Gets the length of a document associated with this partition.
 void normalize(int[] docs, float[] scores, int p, float qw, int fieldID)
          Normalizes a set of document scores all in one go, using a local buffer copy to avoid synchronization and churn in the buffer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

part

protected DiskPartition part
The partition whose values we're storing.


vlFile

protected java.io.File vlFile
The file that (will) contain the document vector lengths.


raf

protected java.io.RandomAccessFile raf
The random access file that we'll use to back our buffer.


vecLens

protected ReadableBuffer vecLens
A buffer containing the vector lengths for the whole document.


fieldLens

protected ReadableBuffer[] fieldLens
Buffers containing vector lengths for the vectored fields.


BUFF_SIZE

protected static int BUFF_SIZE
A standard buffer size to use, in bytes.


logTag

protected static java.lang.String logTag
Constructor Detail

DocumentVectorLengths

public DocumentVectorLengths(DiskPartition part,
                             boolean adjustStats)
                      throws java.io.IOException
Creates a set of vector lengths for a given partition. If the file of vector lengths already exists, it is opened for use. If the file doesn't exist, then the vector lengths will be created by a multitude of threads.

Parameters:
part - the partition whose document vector lengths we will calculate.
adjustStats - if true then if we have to calculate the vector lengths, then we will modify the global term stats.
Throws:
java.io.IOException - if there is any error reading or writing the vector lengths.

DocumentVectorLengths

public DocumentVectorLengths(DiskPartition part,
                             int buffSize,
                             boolean adjustStats)
                      throws java.io.IOException
Creates a set of vector lengths for a given partition. If the file of vector lengths already exists, it is opened for use. If the file doesn't exist, then the vector lengths will be calculated and then stored to the file.

Parameters:
part - the partition whose vector lengths we're storing.
buffSize - the size of the buffer to use when storing the lengths.
adjustStats - if true then if we have to calculate the vector lengths, then we will modify the global term stats.
Throws:
java.io.IOException - if there is any error reading or writing the vector lengths.
Method Detail

calculateLengths

public void calculateLengths(DiskPartition p,
                             TermStatsDictionary gts,
                             boolean adjustStats)
                      throws FileLockException,
                             java.io.IOException
Calculates a set of document vector lengths from a partition using a global set of term statistics. The global term stats may be re-written as a side effect.

Parameters:
p - the partition for which we're calculating document vector lengths
gts - the dictionary of global term stats.
adjustStats - if true, the global term stats will be modified to include the statistics from the term in the partition. This will be the case when computing vector lengths for a new partition, but not when computing vector lengths for a merged partition, since in that case the global term stats will already include data from the partitions that were merged. If this paramater is false the global stats will not be rewritten.
Throws:
FileLockException - if we can't lock the vector length file
java.io.IOException - if there is any error writing the vector lengths

getVectorLength

public float getVectorLength(int docID)
Gets the length of a document associated with this partition. This will be used at query and classification time. Note that our buffer uses 0 based indexing, so we need to subtract one from the document ID!

Parameters:
docID - the ID of the document whose vector length we wish to retrieve.
Returns:
the vector length of the document with the given ID

normalize

public void normalize(int[] docs,
                      float[] scores,
                      int p,
                      float qw,
                      int fieldID)
Normalizes a set of document scores all in one go, using a local buffer copy to avoid synchronization and churn in the buffer. This will modify the scores array.

Parameters:
docs - the document IDs to normalize
scores - the document scores
p - the number of document IDs and scores in the array
qw - the query weight to use for normalization
fieldID - the ID of the field that the scores were computed from and that should be used for normalization.

getVectorLength

public float getVectorLength(int docID,
                             int fieldID)
Gets the length of a document associated with this partition. This will be used at query and classification time. Note that our buffer uses 0 based indexing, so we need to subtract one from the document ID!

Parameters:
docID - the ID of the document whose vector length we wish to retrieve.
fieldID - the ID of the field for which we're looking for the length. A field ID of -1 is interpreted as a request for the length using all vectored fields. If this field was not vectored, then a length of 1 is returned, so that dividing weights by document lengths won't cause problems.
Returns:
the length of the vector for the given ID and vectored field

close

public void close()
           throws java.io.IOException
Closes the file associated with the document lengths.

Throws:
java.io.IOException - if there is any error closing the file