com.sun.labs.minion.indexer.partition
Class DiskPartition

java.lang.Object
  extended by com.sun.labs.minion.indexer.partition.Partition
      extended by com.sun.labs.minion.indexer.partition.DiskPartition
All Implemented Interfaces:
Closeable, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable, java.lang.Comparable<Partition>
Direct Known Subclasses:
ClassifierDiskPartition, ClusterDiskPartition, InvFileDiskPartition

public class DiskPartition
extends Partition
implements Closeable

A partition of the index which is resident on the disk and suitable for querying.

A disk partition consists of four things:

See Also:
DiskDictionary, DocumentVectorLengths

Field Summary
protected static int BUFF_SIZE
          Buffer size for merging.
protected  DelMap deletions
          The deletion map for this partition.
protected  java.io.File delFile
          The deleted documents file.
protected  FileLock delFileLock
          A lock for the deleted documents file.
protected  DiskDictionary docDict
          The document dictionary.
protected  java.io.RandomAccessFile docDictFile
          The stream for the document dictionary.
protected  java.io.RandomAccessFile docPostFile
          The postings stream for the document dictionary.
protected  DictionaryFactory documentDictFactory
          A factory for the document dictionary.
protected  DocumentVectorLengths dvl
          The lengths of the document vectors for this partition.
protected  boolean ignored
          Whether this partition was ignored during a merge, due to it being empty.
protected static java.lang.String logTag
          The tag for this module.
protected  DiskDictionary mainDict
          The main dictionary.
protected  java.io.File[] mainFiles
          The files containing the main data.
protected static float MATCH_CUT_OFF
          The limit for variant entries relationship to a stemmed entry.
protected static int MIN_LEN
          Minimum length of a stem.
protected  java.io.File removedFile
          A File indicating that this partition is no longer active.
protected  TermCache termCache
          A cache of uncompressed postings data.
 
Fields inherited from class com.sun.labs.minion.indexer.partition.Partition
DICT_OFFSETS_SIZE, docDictFactory, entryClass, entryName, indexConfig, mainDictFactory, mainDictFile, mainPostFiles, manager, maxID, nEntries, partNumber, PROP_DOC_DICT_FACTORY, PROP_INDEX_CONFIG, PROP_MAIN_DICT_FACTORY, PROP_PARTITION_MANAGER, stats
 
Constructor Summary
DiskPartition(int partNumber, PartitionManager manager, DictionaryFactory mainDictFactory, DictionaryFactory documentDictFactory)
          Opens a partition with a given number
DiskPartition(int partNumber, PartitionManager manager, DictionaryFactory mainDictFactory, DictionaryFactory documentDictFactory, boolean cacheVectorLengths, int termCacheSize)
          Opens a partition with a given number
 
Method Summary
 boolean close()
          Close the files associated with this partition.
 boolean close(long currTime)
          Close the files associated with this partition, if enough time has passed.
 void createRemoveFile()
           
 void delete()
          Deletes the files associated with this partition.
 boolean deleteDocument(int docID)
          Deletes a document specified by the given ID.
 boolean deleteDocument(java.lang.String key)
          Deletes a document specified by the given key, if it occurs in this partition.
 boolean docsAreMerged()
          Returns true if documents in this partition type can be merged - that is, that the postings of two same-named docs in different partitions will be combined.
 float getAverageDocumentLength()
          Get the average document length in this partition.
 long getCloseTime()
           
 ReadableBuffer getDeletedDocumentsMap()
          Gets the map of deleted documents for this partition.
 DelMap getDelMap()
           
protected  int[] getDocIDMap(ReadableBuffer del)
          Returns a map from the document IDs in this partition to IDs in a partition that has no deleted documents.
 java.util.Iterator getDocumentIterator()
          Gets an iterator for the document keys in this partition.
protected  java.util.Iterator getDocumentIterator(int begin, int end)
          Gets an iterator for some of the document keys in this partition.
 int getDocumentLength(int docID)
          Gets the length of a document (in words) qthat's in this partition.
 DocKeyEntry getDocumentTerm(int docID)
          Gets the entry from the document dictionary corresponding to a given document ID
 DocKeyEntry getDocumentTerm(java.lang.String key)
          Gets the entry from the document dictionary corresponding to a given document key
 float getDocumentVectorLength(int docID)
          Gets the length of a document vector for a given document.
 float getDocumentVectorLength(int docID, int fieldID)
          Gets the length of a document vector for a given document.
 float getDocumentVectorLength(int docID, java.lang.String field)
          Gets the length of a document vector for a given document.
 DocumentVectorLengths getDVL()
          Gets the document vector lengths associated with this partition.
protected  java.nio.ByteBuffer[] getInputBuffers(int size)
          Gets an array of buffers to use for buffering postings during merges.
 DiskDictionary getMainDictionary()
          Returns the main dictionary, to be used by subclasses.
 DictionaryIterator getMainDictionaryIterator()
          Gets an iterator for the entries in the main dictionary.
 java.util.Iterator getMainDictionaryIterator(java.lang.String start, java.lang.String end)
          Gets an iterator for the entries in the main dictionary.
 java.util.Iterator getMainIterator()
          Gets an iterator for the entries in the main dictionary.
 int getMaxDocumentID()
          Get the maximum document ID.
 int getMaxTermID()
          Gets the maximum term ID from the main dictionary.
 int getNDocs()
          Gets the number of documents in this partition.
 int getNEntries()
          Gets the total number of distinct terms in this partition.
 long getNTokens()
          Gets the total number of tokens indexed in this partition.
 QueryEntry getTerm(int id)
          Gets the entry from the main dictionary that has a given ID.
 QueryEntry getTerm(java.lang.String name)
          Gets the entry in the main dictionary associated with a given name.
 QueryEntry getTerm(java.lang.String name, boolean caseSensitive)
          Gets the term associated with a given name.
 QueryEntry getTerm(java.lang.String name, boolean caseSensitive, DiskDictionary.LookupState lus)
          Gets the term associated with a given name.
 TermCache getTermCache()
          Gets the term cache for this partition, if there is one.
protected  void initAll()
          Initializes the main dictionary and the document dictionary.
protected  void initDocDict()
          Initializes the document dictionary, if necessary.
protected  void initDVL(boolean adjustStats)
          Initializes the document vector lengths.
protected  void initMainDict()
          Initializes the main dictionary, if necessary.
protected  void initMainFiles()
          Initializes the files used for the main dictionary and the associated postings.
 boolean isDeleted(int docID)
          Tells us whether a given document ID has been deleted.
 boolean isIndexed(java.lang.String key)
          Checks to see whether a given document is indexed.
 DiskPartition merge(java.util.List<DiskPartition> partitions, java.util.List<DelMap> delMaps, boolean calculateDVL)
          Merges a number of DiskPartitions into a single partition.
 DiskPartition merge(java.util.List<DiskPartition> partitions, java.util.List<DelMap> delMaps, boolean calculateDVL, int depth)
          Merges a number of DiskPartitions into a single partition.
protected  void mergeCustom(int newPartNumber, DiskPartition[] sortedParts, int[][] idMaps, int newMaxDocID, int[] docIDStart, int[] nUndel, int[][] docIDMaps)
          Provides a place to merge data that is specific to a subclass of disk partition.
 void normalize(int[] docs, float[] scores, int p, float qw, int field)
           
protected static void reap(PartitionManager m, int n)
          Reaps the given partition.
 void setCloseTime(long closeTime)
           
protected  void syncDeletedMap()
          Synchronizes the deletion map in memory with the one on disk.
 java.lang.String toString()
           
protected  boolean updatePartition(java.util.Set<java.lang.Object> keys)
          Updates the partition by deleting any documents whose keys are in the given dictionary.
 
Methods inherited from class com.sun.labs.minion.indexer.partition.Partition
compareTo, getAllFiles, getAllFiles, getDocFiles, getDocFiles, getIndexConfig, getMainFiles, getMainFiles, getManager, getName, getNumPostingsChannels, getPartitionNumber, getQueryConfig, getStats, newProperties
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

mainFiles

protected java.io.File[] mainFiles
The files containing the main data.


documentDictFactory

protected DictionaryFactory documentDictFactory
A factory for the document dictionary.


mainDict

protected DiskDictionary mainDict
The main dictionary.


docDict

protected DiskDictionary docDict
The document dictionary.


docDictFile

protected java.io.RandomAccessFile docDictFile
The stream for the document dictionary.


docPostFile

protected java.io.RandomAccessFile docPostFile
The postings stream for the document dictionary.


delFile

protected java.io.File delFile
The deleted documents file.


delFileLock

protected FileLock delFileLock
A lock for the deleted documents file.


removedFile

protected java.io.File removedFile
A File indicating that this partition is no longer active.


deletions

protected DelMap deletions
The deletion map for this partition.


dvl

protected DocumentVectorLengths dvl
The lengths of the document vectors for this partition.


termCache

protected TermCache termCache
A cache of uncompressed postings data.


ignored

protected boolean ignored
Whether this partition was ignored during a merge, due to it being empty.


logTag

protected static java.lang.String logTag
The tag for this module.


BUFF_SIZE

protected static int BUFF_SIZE
Buffer size for merging.


MIN_LEN

protected static int MIN_LEN
Minimum length of a stem.


MATCH_CUT_OFF

protected static float MATCH_CUT_OFF
The limit for variant entries relationship to a stemmed entry.

Constructor Detail

DiskPartition

public DiskPartition(int partNumber,
                     PartitionManager manager,
                     DictionaryFactory mainDictFactory,
                     DictionaryFactory documentDictFactory)
              throws java.io.IOException
Opens a partition with a given number

Parameters:
partNumber - the number of this partition.
manager - the manager for this partition.
mainDictFactory - the dictionary factory that we will use to create the main dictionary
documentDictFactory - the dictionary factory that we will use to create the document dictionary
Throws:
java.io.IOException - If there is an error opening or reading any of the files making up a partition.
See Also:
Partition, Dictionary

DiskPartition

public DiskPartition(int partNumber,
                     PartitionManager manager,
                     DictionaryFactory mainDictFactory,
                     DictionaryFactory documentDictFactory,
                     boolean cacheVectorLengths,
                     int termCacheSize)
              throws java.io.IOException
Opens a partition with a given number

Parameters:
partNumber - the number of this partition.
manager - the manager for this partition.
mainDictFactory - the dictionary factory that we will use to create the main dictionary
documentDictFactory - the dictionary factory that we will use to create the document dictionary
cacheVectorLengths - if true document vector and field vector lengths will be cached in memory for faster access during normalization.
Throws:
java.io.IOException - If there is an error opening or reading any of the files making up a partition.
See Also:
Partition, Dictionary
Method Detail

initAll

protected void initAll()
                throws java.io.IOException
Initializes the main dictionary and the document dictionary.

Throws:
java.io.IOException - if there is any error initializing the dictionaries.

initMainFiles

protected void initMainFiles()
                      throws java.io.IOException
Initializes the files used for the main dictionary and the associated postings.

Throws:
java.io.IOException - if there is any error opening the files

initMainDict

protected void initMainDict()
Initializes the main dictionary, if necessary.


initDocDict

protected void initDocDict()
Initializes the document dictionary, if necessary.


initDVL

protected void initDVL(boolean adjustStats)
Initializes the document vector lengths.

Parameters:
adjustStats - if it's necessary to compute the document vector lengths, should we adjust the term statistics while we're at it?

getDVL

public DocumentVectorLengths getDVL()
Gets the document vector lengths associated with this partition.

Returns:
the document vector lengths associated with this partition

getDocumentIterator

public java.util.Iterator getDocumentIterator()
Gets an iterator for the document keys in this partition. All documents, including those that have been deleted will be returned.

Returns:
an iterator for the entries in the document dictionary, which have the document keys as their names.

getDocumentIterator

protected java.util.Iterator getDocumentIterator(int begin,
                                                 int end)
Gets an iterator for some of the document keys in this partition. All documents in the given range, including those that have been deleted will be returned.

Parameters:
begin - the ID (inclusive) of the document at which we wish to begin iteration
end - the ID (exclusive) of the document at which we wish to end iteration
Returns:
an iterator for the entries in the document dictionary, which have the document keys as their names.

getMainDictionaryIterator

public DictionaryIterator getMainDictionaryIterator()
Gets an iterator for the entries in the main dictionary.

Returns:
an iterator for the entries in the main dictionary.

getMainDictionaryIterator

public java.util.Iterator getMainDictionaryIterator(java.lang.String start,
                                                    java.lang.String end)
Gets an iterator for the entries in the main dictionary.

Parameters:
start - the name of the entry (inclusive) at which to start the iteration
end - the name of the entry (exclusive) at which to stop the iteration
Returns:
an iterator that will return entries from the main dictionary between the provided start and end names

getDocumentTerm

public DocKeyEntry getDocumentTerm(java.lang.String key)
Gets the entry from the document dictionary corresponding to a given document key

Parameters:
key - the document key
Returns:
the entry in the document dictionary for this key or null if this key does not occur in the document dictionary or if the document existed in this partition, but it was deleted.

getDocumentTerm

public DocKeyEntry getDocumentTerm(int docID)
Gets the entry from the document dictionary corresponding to a given document ID

Parameters:
docID - the document ID
Returns:
the entry in the document dictionary for this key or null if this id does not occur in the document dictionary. Note that this may return the entry for a document that has been deleted!

getDocumentVectorLength

public float getDocumentVectorLength(int docID)
Gets the length of a document vector for a given document. Note that this may cause all of the document vector lengths for this partition to be calculated!

Parameters:
docID - the ID of the document for whose vector we want the length
Returns:
the length of the document. If there are any errors getting the length, a value of 1 is returned.

getDocumentVectorLength

public float getDocumentVectorLength(int docID,
                                     java.lang.String field)
Gets the length of a document vector for a given document. Note that this may cause all of the document vector lengths for this partition to be calculated!

Parameters:
docID - the ID of the document for whose vector we want the length
field - the vectored field for which we we want the document vector length. If this value is null the length for all vectored fields is returned. If this value is the empty string, the length for the default body field is returned. If this value does not name a vectored field, a default value of 1 will be returned.
Returns:
the length of the document. If there are any errors getting the length, a value of 1 is returned.

getDocumentVectorLength

public float getDocumentVectorLength(int docID,
                                     int fieldID)
Gets the length of a document vector for a given document. Note that this may cause all of the document vector lengths for this partition to be calculated!

Parameters:
docID - the ID of the document for whose vector we want the length
fieldID - the ID of the field for which we want the length if this value is less than 0, the length for all vectored fields is returned. If this value is 0, the length for the default body field is returned. Other wise, the length for the corresponding field is returned.
Returns:
the length of the document. If there are any errors getting the length, a value of 1 is returned.

normalize

public void normalize(int[] docs,
                      float[] scores,
                      int p,
                      float qw,
                      int field)

syncDeletedMap

protected void syncDeletedMap()
Synchronizes the deletion map in memory with the one on disk.


close

public boolean close()
Close the files associated with this partition.

Returns:
true if the files were successfully closed.

close

public boolean close(long currTime)
Close the files associated with this partition, if enough time has passed. We normally want to delay the close of the dictionaries in order to make sure that any queries in flight have completed.

Specified by:
close in interface Closeable
Parameters:
currTime - the current time
Returns:
true if the thing was closed, false otherwise.

delete

public void delete()
Deletes the files associated with this partition.


reap

protected static void reap(PartitionManager m,
                           int n)
Reaps the given partition. If the postings file cannot be removed, then we return control immediately.

Parameters:
m - The manager associated with the partition.
n - The partition number to reap.

getTerm

public QueryEntry getTerm(java.lang.String name)
Gets the entry in the main dictionary associated with a given name. This is a case-insensitive lookup.

Parameters:
name - The name of the term, as a string.
Returns:
The term associated with that name.
See Also:
getTerm(String,boolean)

getTerm

public QueryEntry getTerm(int id)
Gets the entry from the main dictionary that has a given ID.

Parameters:
id - The ID of the term that we want to get.
Returns:
the entry associated with that ID, or null if the ID is not in the main dictionary.

getTermCache

public TermCache getTermCache()
Gets the term cache for this partition, if there is one.

Returns:
the term cache, or null if there is none.

getTerm

public QueryEntry getTerm(java.lang.String name,
                          boolean caseSensitive)
Gets the term associated with a given name.

Parameters:
name - The name of the term.
caseSensitive - If true then the term should be looked up in the case that it is given.
Returns:
the entry from the main dicitionary associated with the given name.

getTerm

public QueryEntry getTerm(java.lang.String name,
                          boolean caseSensitive,
                          DiskDictionary.LookupState lus)
Gets the term associated with a given name.

Parameters:
name - The name of the term.
caseSensitive - If true then the term should be looked up in the case that it is given.
lus - a lookup state to use for the dictionary lookup
Returns:
the entry from the main dicitionary associated with the given name.

isIndexed

public boolean isIndexed(java.lang.String key)
Checks to see whether a given document is indexed. A document is indexed if the provided key appears in the index and the associated document ID has not been deleted.

Parameters:
key - the key for the document that we want to check
Returns:
true if this key occurs in this partition and the document has not been deleted.

deleteDocument

public boolean deleteDocument(int docID)
Deletes a document specified by the given ID.

Parameters:
docID - The ID of the file to delete.
Returns:
true if the document is in this partition and was deleted, false otherwise.

deleteDocument

public boolean deleteDocument(java.lang.String key)
Deletes a document specified by the given key, if it occurs in this partition.

Parameters:
key - The document key to be deleted.
Returns:
true if the document occurs in this partition and was deleted, false otherwise.

isDeleted

public boolean isDeleted(int docID)
Tells us whether a given document ID has been deleted.

Parameters:
docID - the ID of the document that we want to check
Returns:
true if the document has been deleted, false otherwise.

updatePartition

protected boolean updatePartition(java.util.Set<java.lang.Object> keys)
Updates the partition by deleting any documents whose keys are in the given dictionary. Available only to package mates.

Parameters:
keys - a set of keys to delete. The string representation of the elements of the set will be the keys to delete.
Returns:
true if any documents were deleted, false otherwise.

getNDocs

public int getNDocs()
Gets the number of documents in this partition. This excludes deleted documents.

Specified by:
getNDocs in class Partition
Returns:
the number of documents in this partition, not including deleted documents

getMaxDocumentID

public int getMaxDocumentID()
Get the maximum document ID. Note that this value can be larger than the number of documents in the partition, due to the presence of deleted documents.

Returns:
the maximum ID assigned to a document in this partition.

getMaxTermID

public int getMaxTermID()
Gets the maximum term ID from the main dictionary.

Returns:
the maximum term ID in the dictionary

getDocumentLength

public int getDocumentLength(int docID)
Gets the length of a document (in words) qthat's in this partition.

Parameters:
docID - the ID of the document for which we want the length
Returns:
the length of the document
See Also:
for a way to get the length of the vector associated with this document

getAverageDocumentLength

public float getAverageDocumentLength()
Get the average document length in this partition.

Returns:
the average length (in words) of the documents in this partition

getNTokens

public long getNTokens()
Gets the total number of tokens indexed in this partition.

Returns:
the total number of tokens

getNEntries

public int getNEntries()
Gets the total number of distinct terms in this partition.

Returns:
the number of distinct terms in the partition

getDeletedDocumentsMap

public ReadableBuffer getDeletedDocumentsMap()
Gets the map of deleted documents for this partition.

Returns:
the bitmap of deleted documents.

getDelMap

public DelMap getDelMap()

getDocIDMap

protected int[] getDocIDMap(ReadableBuffer del)
Returns a map from the document IDs in this partition to IDs in a partition that has no deleted documents.

Parameters:
del - a buffer of deleted documents
Returns:
An array of int containing the mapping, where deleted documents map to < 0, or null if there are no deleted documents. The 0th element of the returned array contains the number of undeleted documents.
See Also:
DelMap.getDelMap()

getMainIterator

public java.util.Iterator getMainIterator()
Gets an iterator for the entries in the main dictionary.

Returns:
an iterator for the entries in the main dictionary.

getInputBuffers

protected java.nio.ByteBuffer[] getInputBuffers(int size)
Gets an array of buffers to use for buffering postings during merges.

Parameters:
size - The size of the input buffers to use.
Returns:
An array of buffers large enough to handle any dictionary merge.

getMainDictionary

public DiskDictionary getMainDictionary()
Returns the main dictionary, to be used by subclasses.

Returns:
the main dictionary

merge

public DiskPartition merge(java.util.List<DiskPartition> partitions,
                           java.util.List<DelMap> delMaps,
                           boolean calculateDVL)
                    throws java.lang.Exception
Merges a number of DiskPartitions into a single partition.

Parameters:
partitions - the partitions to merge
delMaps - the state of the deletion maps for the partitions to merge before the merge started. We need these to be the same as the ones at the place where the merge was called for (see PartitionManager.Merger), otherwise we might get some skew in the maps between when they are recorded there and recorded here!
calculateDVL - if true, then calculate the document vector lengths for the documents in the merged partition after the merge is finished.
Returns:
the newly-merged partition.
Throws:
java.lang.Exception - If there is any error during the merge.

merge

public DiskPartition merge(java.util.List<DiskPartition> partitions,
                           java.util.List<DelMap> delMaps,
                           boolean calculateDVL,
                           int depth)
                    throws java.lang.Exception
Merges a number of DiskPartitions into a single partition.

Parameters:
partitions - the partitions to merge
delMaps - the state of the deletion maps for the partitions to merge before the merge started. We need these to be the same as the ones at the place where the merge was called for (see PartitionManager.Merger), otherwise we might get some skew in the maps between when they are recorded there and recorded here!
calculateDVL - if true, then calculate the document vector lengths for the documents in the merged partition after the merge is finished.
Returns:
the newly-merged partition.
Throws:
java.lang.Exception - If there is any error during the merge.

mergeCustom

protected void mergeCustom(int newPartNumber,
                           DiskPartition[] sortedParts,
                           int[][] idMaps,
                           int newMaxDocID,
                           int[] docIDStart,
                           int[] nUndel,
                           int[][] docIDMaps)
                    throws java.lang.Exception
Provides a place to merge data that is specific to a subclass of disk partition. This method will be called after the disk partition data is merged, but inside the try block for the whole merge.

Parameters:
newPartNumber - the number of the new partition
sortedParts - the sorted list of partitions
idMaps - a set of maps from old entry ids in the main dictionary to new entry ids in the merged dictionary
newMaxDocID - the new maximum document id
docIDStart - the starting doc ids
nUndel - the number of undeleted documents in each partition
docIDMaps - doc id maps (see merge)
Throws:
java.lang.Exception

docsAreMerged

public boolean docsAreMerged()
Returns true if documents in this partition type can be merged - that is, that the postings of two same-named docs in different partitions will be combined.

Returns:
false by default, other classes may override

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

setCloseTime

public void setCloseTime(long closeTime)
Specified by:
setCloseTime in interface Closeable

getCloseTime

public long getCloseTime()
Specified by:
getCloseTime in interface Closeable

createRemoveFile

public void createRemoveFile()
Specified by:
createRemoveFile in interface Closeable