com.sun.labs.minion.indexer.partition
Class PartitionManager

java.lang.Object
  extended by com.sun.labs.minion.indexer.partition.PartitionManager
All Implemented Interfaces:
com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable
Direct Known Subclasses:
ClassifierManager, ClusterManager

public class PartitionManager
extends java.lang.Object
implements com.sun.labs.util.props.Configurable

For any particular collection that we will index, there can be multiple entry types and for each entry type there can be multiple partitions. An PartitionManager is used to manage all of the partitions in a collection that have the same entry type.

The static getManager method can be used to retrieve the partition manager for a particular entry type.

The PartitionManager maintains a static HashMap that maps index directories and entry names to the actual manager instance for entries of that type. The key for the hash is <indexDir>/<EntryType>, which will allow us to open multiple collections in the same VM.

The PartitionManager for a particular entry type provides access to the set of DiskPartitions that contain entries of that type.

The PartitionManager is also responsible for providing two kinds of data for partition use. First, it hands out the numbers that are used for the partitions. These numbers are local to the PartitionManager for a given entry type.

A partition manager can be given a name via the setName method. This name can be used to find the corresponding paritition manager object. The name defaults to the name of the index directory.


Nested Class Summary
protected  class PartitionManager.ExtFilter
          A class that implements SimpleFilter so that we can find partition files with various extensions.
protected  class PartitionManager.HouseKeeper
          An inner class that does housekeeping duties during querying.
 class PartitionManager.Merger
          A threadable class used for merging a list of partitions.
 
Field Summary
protected  java.io.File activeFile
          The file containing the list of active partitions.
protected  FileLock activeLock
          A lock for our active file.
protected  java.util.List<DiskPartition> activeParts
          The list of partitions that we're managing.
protected  SearchEngineImpl engine
          The search engine that is using us.
protected  java.util.List<java.lang.String> fieldsToLoad
          A list of field names that should be loaded into main memory for faster processing.
protected  IndexConfig indexConfig
          The index configuration for the index we'll be managing.
protected  java.lang.String indexDir
          The directory where the index is.
protected  java.io.File indexDirFile
          The File containing the directory where the index is held.
protected  PartitionManager.HouseKeeper keeper
          A house keeper class.
protected  java.lang.Thread keeperThread
          A thread running the housekeeping duties.
protected  java.util.Date lastPurgeTime
          The last time that a purge was called.
protected  java.io.File lockDirFile
          The directory where locks will be put.
protected  java.lang.String logTag
          The tag for this module.
protected  java.util.List<DiskPartition> mergedParts
          A list of partitions that have been merged.
protected  FileLock mergeLock
          A lock for the collection so that only one merge may be ongoing at any time.
protected  int mergeRate
          The rate of partition merges - bigger means less merges, faster indexing, more parts, slower queries
protected  int mergeSpace
          The amount of space (in bytes) that we're willing to devote to buffering postings entries during merges.
protected  java.lang.Thread mergeThread
          A thread to be used during merge operations.
protected  MetaFile metaFile
          The MetaFile containing the number for the next partition to write and the field name maps.
protected  java.lang.String name
          The configuration name for this partition manager.
static java.lang.String PROP_ACTIVE_CHECK_INTERVAL
           
static java.lang.String PROP_ASYNC_MERGES
           
static java.lang.String PROP_CALCULATE_DVL
           
static java.lang.String PROP_INDEX_CONFIG
           
static java.lang.String PROP_LOCK_DIR
           
static java.lang.String PROP_MAX_MERGE_SIZE
           
static java.lang.String PROP_MERGE_RATE
           
static java.lang.String PROP_OPEN_PARTITION_HIGH_WATER_MARK
          A property for the maximum number of open partitions that we'll allow.
static java.lang.String PROP_OPEN_PARTITION_LOW_WATER_MARK
          A property for the "low water" number of open partitions that we'll allow.
static java.lang.String PROP_PART_CLOSE_DELAY
           
static java.lang.String PROP_PART_REAP_DELAY
           
static java.lang.String PROP_PARTITION_FACTORY
           
static java.lang.String PROP_REAP_DOES_NOTHING
           
static java.lang.String PROP_STARTING_DATA
          A configuration property that can be used to name an index directory whose contents should be copied into the current directory when it is created.
static java.lang.String PROP_TERMSTATS_DICT_FACTORY
           
protected  java.util.Timer queryTimer
          A timer that can be used during querying to time tasks.
protected  int randID
          A random number that we can use to tell the difference between various partition managers.
protected  java.lang.String subDir
          The subdirectory of the main index directory where we will put our partitions.
protected  java.util.List<Closeable> thingsToClose
          The list of parts to close.
 
Constructor Summary
PartitionManager()
          Instantiates a PartitionManager with the given index configuration.
 
Method Summary
 void addIndexListener(IndexListener il)
           
protected  void addNewPartition(DiskPartition dp, java.util.Set<java.lang.Object> keys)
           
protected  void addNewPartition(int partNumber, java.util.Set<java.lang.Object> keys)
          Adds a new partition to this manager.
 void checkHK()
          Checks to make sure that the housekeeper is still alive.
 void deleteDocument(java.lang.String key)
          Deletes a single document from whatever partition that it is in.
 void deleteDocuments(java.util.List<java.lang.String> keys)
          Deletes a set of documents from whatever partition that they are in.
protected  void deleteKeys(java.util.Set<java.lang.Object> keys, java.util.List parts)
          Delete the given keys from the given list of partitions.
 java.util.List<DiskPartition> getActivePartitions()
          Returns a list of the currently active partitions.
 java.util.List getAllFieldValues(java.lang.String field, java.lang.String key)
          Gets all of the field values associated with a given field in a given document.
 boolean getCalculateDVL()
           
 double getDistance(int d1, int d2, java.lang.String name)
           
 double getDistance(java.lang.String k1, java.lang.String k2, java.lang.String name)
          Gets the distance between two documents, based on the values stored in in a given feature vector saved field.
 DocKeyEntry getDocumentTerm(java.lang.String key)
          Gets a term from a document dictionary corresponding to the given key.
 DocumentVector getDocumentVector(java.lang.String key)
          Gets a document vector for the given document key.
 DocumentVector getDocumentVector(java.lang.String key, java.lang.String field)
          Gets a document vector for the given document key.
 DocumentVector getDocumentVector(java.lang.String key, WeightedField[] fields)
          Gets a composite document vector for the given document key.
 SearchEngine getEngine()
          Gets the search engine associated with this PartitionManager instance
 FieldInfo getFieldInfo(java.lang.String name)
          Gets the information for a named field.
 FieldIterator getFieldIterator(java.lang.String field)
          Gets an iterator for all the values in a field.
 FieldIterator getFieldIterator(java.lang.String field, boolean ignoreCase)
          Gets an iterator for all the values in a field.
 java.util.List<java.lang.String> getFieldNames()
           
 java.lang.Object getFieldValue(java.lang.String field, java.lang.String key)
          Gets a single field value associated with a given field in a given document.
 IndexConfig getIndexConfig()
          Gets the index configuration for this manager.
 java.lang.String getIndexDir()
          Get the directory where the index is.
 java.util.Date getLastPurgeTime()
           
 java.lang.String getLockDir()
           
 java.util.SortedSet<FieldValue> getMatching(java.lang.String field, java.lang.String pattern)
          Gets the values for the given field that match the given pattern.
 PartitionManager.Merger getMerger()
          Gets an instance of the merger class that can be used to merge any partitions that require it.
 PartitionManager.Merger getMerger(java.util.List<DiskPartition> l)
          Gets an instance of the merger class in order to merge a list of partitions.
 PartitionManager.Merger getMerger(java.util.List<DiskPartition> l, FileLock localMergeLock)
          Gets an instance of the merger class in order to merge a list of partitions.
 PartitionManager.Merger getMergerFromNumbers(java.util.List<java.lang.Integer> l)
          Gets a merger that will merge the partitions represented by the given list of partition numbers.
 MetaFile getMetaFile()
          Get the meta file for this index.
 int getNActive()
          Returns the number of active partitions being managed.
 java.lang.String getName()
          Gets the name of the index.
 int getNDocs()
          Gets the total number of documents managed.
protected  int getNextPartitionNumber()
          Gets the next number to use for a partition.
 int getNFields()
           
 int getNTerms()
          Gets the total number of terms indexed.
 long getNTokens()
          Gets the total number of tokens indexed.
 int getPartCloseDelay()
           
protected  java.util.List<DiskPartition> getPartitions(java.util.List<java.lang.Integer> partNums)
          Gets a list of partitions from the corresponding partition numbers.
protected  java.util.List<java.lang.Integer> getPartNumbers(java.util.List<DiskPartition> parts)
          Gets a list of the partition numbers for the given partitions.
 QueryConfig getQueryConfig()
          Gets the query configuration for this manager.
 java.util.Timer getQueryTimer()
          Get a timer that can be used during querying to time tasks.
 int getRandID()
           
 ResultSet getSimilar(java.lang.String key, java.lang.String name)
          Gets a set of results ordered by similarity to the given document, calculated by computing the euclidean distance based on the feature vector stored in the given field.
 TermStatsImpl getTermStats(java.lang.String name)
          Gets the term statistics for a term
 TermStatsDictionary getTermStatsDict()
          Gets the term statisitics dictionary for this index
 java.util.List<FieldFrequency> getTopFieldValues(java.lang.String field, int n, boolean ignoreCase)
          Gets a list of the top n most frequent field values for a given named field.
 boolean hasFieldedVectors()
          Indicates whether this index uses fielded document vectors.
protected  void init()
          Initializes the PartitionManager.
 boolean isCasedIndex()
          Indicates whether this index uses a cased main dictionary.
 boolean isIndexed(java.lang.String key)
          Checks to see if a document is in the index.
protected  java.io.File makeActiveFile()
          Makes a File for the active file.
 java.io.File makeDeletedDocsFile(int partNumber)
          Makes a File for the file containing the bitmap of deleted documents.
static java.io.File makeDeletedDocsFile(java.lang.String iD, int partNumber)
          Makes a File for the file containing the bitmap of deleted documents.
 java.io.File makeDictionaryFile(int partNumber, java.lang.String type)
          Makes a File for a dictionary.
static java.io.File makeDictionaryFile(java.lang.String iD, int partNumber, java.lang.String type)
          Makes a File for a dictionary.
protected  java.io.File makeMetaFile()
          Makes a File for the meta file.
 java.io.File makePostingsFile(int partNumber, java.lang.String type)
          Makes a File for a postings file.
 java.io.File makePostingsFile(int partNumber, java.lang.String type, int number)
          Makes a File for a postings file.
static java.io.File makePostingsFile(java.lang.String iD, int partNumber, java.lang.String type, int number)
          Makes a File for a postings file.
 java.io.File makeRemovedPartitionFile(int partNumber)
          Makes a File that we'll use to indicate that this partition has been merged away.
static java.io.File makeRemovedPartitionFile(java.lang.String iD, int partNumber)
          Makes a File that we'll use to indicate that this partition has been merged away.
 java.io.File makeTaxonomyFile(int partNumber)
           
static java.io.File makeTaxonomyFile(java.lang.String iD, int partNumber)
           
protected  java.io.File makeTermStatsFile(int tsn)
          Makes a File for the global term stats.
 java.io.File makeVectorLengthFile(int partNumber)
          Makes a File for the file containing the lengths of document vectors.
static java.io.File makeVectorLengthFile(java.lang.String iD, int partNumber)
          Makes a File for the file containing the lengths of document vectors.
protected  DiskPartition merge(java.util.List<DiskPartition> parts)
          Merges together the partitions in the provided list.
 DiskPartition mergeAll()
          Merges all partitions from the active list into a new partition.
 java.util.List<DiskPartition> mergeGeometric()
          This is a geometric merge heuristic controlled by the mergeRate.
protected  DiskPartition mergeInPieces(java.util.List<DiskPartition> parts)
          Breakss a list of partitions into blocks of mergeBlockSize, and merges those.
protected  DiskPartition newDiskPartition(java.lang.Integer partNum, PartitionManager m)
          Instantiates a disk partition of the correct type for this manager
 void newProperties(com.sun.labs.util.props.PropertySheet ps)
           
 void noMoreMerges()
           
 void purge()
          Purges the collection.
protected  java.util.List<java.lang.Integer> readActiveFile()
          Reads the numbers of the active partitions from the active file.
protected  DiskPartition realMerge(java.util.List<DiskPartition> diskParts, boolean calculateDVL)
          Merges a list of partitions.
 void reap()
          Reaps deleted partitions from the collection.
protected  void reapPartition(int partNumber)
          A method to reap a single partition.
 void recalculateTermStats()
          Regenerates the term stats for the currently active partitions.
static void recover(java.lang.String iD)
          Recovers an index directory.
 void removeIndexListener(IndexListener il)
           
 void setEngine(SearchEngineImpl engine)
          Sets the search engine associated with this partition manager.
 void setLockDir(java.lang.String lockDir)
           
 void setMergeRate(int rate)
          Sets the rate of partition merges during indexing.
 void setPartCloseDelay(int partCloseDelay)
           
 void shutdown()
          Shuts down the manager.
protected  void startHK()
          Starts the housekeeping thread if it's null or dead.
 java.util.List<DiskPartition> updateActiveParts(boolean addNew)
          Reads the active file and adds any new partitions to our active list.
protected  void updateTermStats()
           
protected  void writeActiveFile(java.util.List<DiskPartition> parts)
          Writes a list of partition numbers to the active file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

engine

protected SearchEngineImpl engine
The search engine that is using us.


indexConfig

protected IndexConfig indexConfig
The index configuration for the index we'll be managing.


queryTimer

protected java.util.Timer queryTimer
A timer that can be used during querying to time tasks.


name

protected java.lang.String name
The configuration name for this partition manager.


lastPurgeTime

protected java.util.Date lastPurgeTime
The last time that a purge was called. If new partitions start dumping before this time, they shouldn't be added to the active list.


lockDirFile

protected java.io.File lockDirFile
The directory where locks will be put.


activeFile

protected java.io.File activeFile
The file containing the list of active partitions.


activeLock

protected FileLock activeLock
A lock for our active file. We'll use this as a global lock.


mergeLock

protected FileLock mergeLock
A lock for the collection so that only one merge may be ongoing at any time.


activeParts

protected java.util.List<DiskPartition> activeParts
The list of partitions that we're managing.


thingsToClose

protected java.util.List<Closeable> thingsToClose
The list of parts to close.


mergedParts

protected java.util.List<DiskPartition> mergedParts
A list of partitions that have been merged.


fieldsToLoad

protected java.util.List<java.lang.String> fieldsToLoad
A list of field names that should be loaded into main memory for faster processing.


mergeThread

protected java.lang.Thread mergeThread
A thread to be used during merge operations.


keeper

protected PartitionManager.HouseKeeper keeper
A house keeper class.


keeperThread

protected java.lang.Thread keeperThread
A thread running the housekeeping duties.


metaFile

protected MetaFile metaFile
The MetaFile containing the number for the next partition to write and the field name maps.


indexDir

protected java.lang.String indexDir
The directory where the index is. Shared by all PartitionManagers in a given directory.


indexDirFile

protected java.io.File indexDirFile
The File containing the directory where the index is held.


mergeSpace

protected int mergeSpace
The amount of space (in bytes) that we're willing to devote to buffering postings entries during merges. The default is 5MB.


mergeRate

protected int mergeRate
The rate of partition merges - bigger means less merges, faster indexing, more parts, slower queries


randID

protected int randID
A random number that we can use to tell the difference between various partition managers.


logTag

protected java.lang.String logTag
The tag for this module.


subDir

protected java.lang.String subDir
The subdirectory of the main index directory where we will put our partitions.


PROP_PARTITION_FACTORY

@ConfigComponent(type=DiskPartitionFactory.class)
public static final java.lang.String PROP_PARTITION_FACTORY
See Also:
Constant Field Values

PROP_INDEX_CONFIG

@ConfigComponent(type=IndexConfig.class)
public static final java.lang.String PROP_INDEX_CONFIG
See Also:
Constant Field Values

PROP_MERGE_RATE

@ConfigInteger(defaultValue=5)
public static final java.lang.String PROP_MERGE_RATE
See Also:
Constant Field Values

PROP_OPEN_PARTITION_HIGH_WATER_MARK

@ConfigInteger(defaultValue=40)
public static final java.lang.String PROP_OPEN_PARTITION_HIGH_WATER_MARK
A property for the maximum number of open partitions that we'll allow. Once this number of open partitions is crossed, no new partitions will be allowed to be dumped until the number of partitions can be decreased. Generally speaking, this is an exceptional condition and it will cause indexing to slow down substantially. We need to monitor this because having too many open partitions can lead to running out of filehandles.

See Also:
Constant Field Values

PROP_OPEN_PARTITION_LOW_WATER_MARK

@ConfigInteger(defaultValue=20)
public static final java.lang.String PROP_OPEN_PARTITION_LOW_WATER_MARK
A property for the "low water" number of open partitions that we'll allow. Once the highwater mark for open partitions has been reached, no more partitions may be added to the index until the number of open partitions can be decreased below this value.

See Also:
Constant Field Values

PROP_MAX_MERGE_SIZE

@ConfigInteger(defaultValue=20)
public static final java.lang.String PROP_MAX_MERGE_SIZE
See Also:
Constant Field Values

PROP_ACTIVE_CHECK_INTERVAL

@ConfigInteger(defaultValue=1000)
public static final java.lang.String PROP_ACTIVE_CHECK_INTERVAL
See Also:
Constant Field Values

PROP_PART_CLOSE_DELAY

@ConfigInteger(defaultValue=15000)
public static final java.lang.String PROP_PART_CLOSE_DELAY
See Also:
Constant Field Values

PROP_ASYNC_MERGES

@ConfigBoolean(defaultValue=true)
public static final java.lang.String PROP_ASYNC_MERGES
See Also:
Constant Field Values

PROP_PART_REAP_DELAY

@ConfigInteger(defaultValue=15000)
public static final java.lang.String PROP_PART_REAP_DELAY
See Also:
Constant Field Values

PROP_CALCULATE_DVL

@ConfigBoolean(defaultValue=true)
public static final java.lang.String PROP_CALCULATE_DVL
See Also:
Constant Field Values

PROP_LOCK_DIR

@ConfigString
public static final java.lang.String PROP_LOCK_DIR
See Also:
Constant Field Values

PROP_TERMSTATS_DICT_FACTORY

@ConfigComponent(type=TermStatsFactory.class,
                 mandatory=false)
public static final java.lang.String PROP_TERMSTATS_DICT_FACTORY
See Also:
Constant Field Values

PROP_REAP_DOES_NOTHING

@ConfigBoolean(defaultValue=false)
public static final java.lang.String PROP_REAP_DOES_NOTHING
See Also:
Constant Field Values

PROP_STARTING_DATA

@ConfigString(defaultValue="",
              mandatory=false)
public static final java.lang.String PROP_STARTING_DATA
A configuration property that can be used to name an index directory whose contents should be copied into the current directory when it is created. All data in that directory will be copied, whether it's index data or not.

See Also:
Constant Field Values
Constructor Detail

PartitionManager

public PartitionManager()
Instantiates a PartitionManager with the given index configuration.

Method Detail

addIndexListener

public void addIndexListener(IndexListener il)

removeIndexListener

public void removeIndexListener(IndexListener il)

init

protected void init()
Initializes the PartitionManager. Allows the directory to be parameterized so that a subclass can use a different directory.


startHK

protected void startHK()
Starts the housekeeping thread if it's null or dead.


readActiveFile

protected java.util.List<java.lang.Integer> readActiveFile()
                                                    throws java.io.IOException,
                                                           FileLockException
Reads the numbers of the active partitions from the active file.

Returns:
a list of the numbers of the active partitions.
Throws:
java.io.IOException - if there are any errors reading the active file.
FileLockException - if there is any error locking or unlocking the active file.

writeActiveFile

protected void writeActiveFile(java.util.List<DiskPartition> parts)
                        throws java.io.IOException,
                               FileLockException
Writes a list of partition numbers to the active file. Locks the active file if necessary.

Parameters:
parts - The list of partitions or partition numbers to write to the active file.
Throws:
java.io.IOException - if there is an error writing the file.
FileLockException - if there is any error locking the active file.

updateActiveParts

public java.util.List<DiskPartition> updateActiveParts(boolean addNew)
                                                throws java.lang.Exception
Reads the active file and adds any new partitions to our active list.

Parameters:
addNew - Whether newly activated partitions should be added to the active list immediately.
Returns:
The list of newly opened partitions.
Throws:
java.lang.Exception - if anything goes wrong.

addNewPartition

protected void addNewPartition(int partNumber,
                               java.util.Set<java.lang.Object> keys)
Adds a new partition to this manager. We will remove the given keys from older partitions.

Parameters:
partNumber - The number of the partition.
keys - A list of Strings representing the document keys for the documents indexed into the new partition. They will be removed from the old partitions.

addNewPartition

protected void addNewPartition(DiskPartition dp,
                               java.util.Set<java.lang.Object> keys)

getNActive

public int getNActive()
Returns the number of active partitions being managed.

Returns:
the number of active partitions

getActivePartitions

public java.util.List<DiskPartition> getActivePartitions()
Returns a list of the currently active partitions. Note that these partitions may be closed if you hang onto this list for a long time!

Returns:
the active partitions

newDiskPartition

protected DiskPartition newDiskPartition(java.lang.Integer partNum,
                                         PartitionManager m)
                                  throws java.io.IOException
Instantiates a disk partition of the correct type for this manager

Parameters:
partNum - the partition number
m - the manager
Returns:
the new partition
Throws:
java.io.IOException - if there is any error opening the partition

getQueryTimer

public java.util.Timer getQueryTimer()
Get a timer that can be used during querying to time tasks.

Returns:
the query timer

getIndexDir

public java.lang.String getIndexDir()
Get the directory where the index is. Shared by all PartitionManagers in a given directory.

Returns:
the index dir

setMergeRate

public void setMergeRate(int rate)
Sets the rate of partition merges during indexing.

Parameters:
rate - Controls the rate of merging. Must be >= 2. Lower value leads to fewer partitions, faster searches, more merges, slower indexing

isIndexed

public boolean isIndexed(java.lang.String key)
Checks to see if a document is in the index.

Parameters:
key - the key for the document that we wish to check.
Returns:
true if the document is in the index. A document is considered to be in the index if a document with the given key appears in the index and has not been deleted.

deleteDocument

public void deleteDocument(java.lang.String key)
Deletes a single document from whatever partition that it is in.

Parameters:
key - The document key for the document to be deleted.

deleteDocuments

public void deleteDocuments(java.util.List<java.lang.String> keys)
Deletes a set of documents from whatever partition that they are in.

Parameters:
keys - The list of keys of the documents to be deleted.

deleteKeys

protected void deleteKeys(java.util.Set<java.lang.Object> keys,
                          java.util.List parts)
Delete the given keys from the given list of partitions. This will force the deletion bitmaps to be flushed to the disk.

Parameters:
keys - The list of document keys to delete.
parts - The partitions to delete the keys from.

getDocumentTerm

public DocKeyEntry getDocumentTerm(java.lang.String key)
Gets a term from a document dictionary corresponding to the given key. Terms associated with documents that have been deleted are ignored.

Parameters:
key - the key for which we want a term from the document dictionary
Returns:
the entry, or null if this key does not appear in the index or if the associated document has been deleted

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key)
Gets a document vector for the given document key.

Parameters:
key - the key of the document for which we want a vector
Returns:
the document vector for the document with the given key, or null if this key does not appear in the index or if the associated document has been deleted.

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key,
                                        java.lang.String field)
Gets a document vector for the given document key.

Parameters:
key - the key of the document for which we want a vector
field - the field for which we want a document vector. If this parameter is null, then a vector containing the terms from all vectored fields in the document is returned. If this value is the empty string, then a vector for the contents of the document that are not in any field are returned. If this value is the name of a field that was not vectored during indexing, an empty vector will be returned.
Returns:
the document vector for the document with the given key, or null if this key does not appear in the index or if the associated document has been deleted.

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key,
                                        WeightedField[] fields)
Gets a composite document vector for the given document key.

Parameters:
key - the key of the document for which we want a vector
fields - the fields for which we want a document vector.
Returns:
the document vector for the document with the given key, or null if this key does not appear in the index or if the associated document has been deleted.

getSimilar

public ResultSet getSimilar(java.lang.String key,
                            java.lang.String name)
Gets a set of results ordered by similarity to the given document, calculated by computing the euclidean distance based on the feature vector stored in the given field.

Parameters:
key - the key of the document to which we'll compute similarity.
name - the name of the field containing the feature vectors that we'll use in the similarity computation.
Returns:
a result set containing the distance between the given document and all of the documents. The scores assigned to the documents are the distance scores, and so the returned set will be sorted in increasing order of the document score. It is up to the application to handle the scores in whatever way they deem appropriate.

getDistance

public double getDistance(int d1,
                          int d2,
                          java.lang.String name)

getDistance

public double getDistance(java.lang.String k1,
                          java.lang.String k2,
                          java.lang.String name)
Gets the distance between two documents, based on the values stored in in a given feature vector saved field.

Parameters:
k1 - the first key
k2 - the second key
name - the name of the feature vector field for which we want the distance
Returns:
the euclidean distance between the two documents' feature vectors. If the field value is not defined for either of the two documents, Double.POSITIVE_INFINITY is returned.

getMatching

public java.util.SortedSet<FieldValue> getMatching(java.lang.String field,
                                                   java.lang.String pattern)
Gets the values for the given field that match the given pattern.

Parameters:
field - the saved, string field against whose values we will match. If the named field is not saved or is not a string field, then the empty set will be returned.
pattern - the pattern for which we'll find matching field values.
Returns:
a sorted set of field values. This set will be ordered by the proportion of the field value that is covered by the given pattern.

getFieldIterator

public FieldIterator getFieldIterator(java.lang.String field)
Gets an iterator for all the values in a field. The values are returned by the iterator in the order defined by the field type.

Parameters:
field - The name of the field who's values we need an iterator for.
Returns:
An iterator for the given field. If the field is not a saved field, then an iterator that will return no values will be returned.

getFieldIterator

public FieldIterator getFieldIterator(java.lang.String field,
                                      boolean ignoreCase)
Gets an iterator for all the values in a field. The values are returned by the iterator in the order defined by the field type.

Parameters:
field - The name of the field who's values we need an iterator for.
ignoreCase - whether the iterator should ignore case when returing results
Returns:
An iterator for the given field. If the field is not a saved field, then an iterator that will return no values will be returned.

getAllFieldValues

public java.util.List getAllFieldValues(java.lang.String field,
                                        java.lang.String key)
Gets all of the field values associated with a given field in a given document.

Parameters:
field - The name of the field for which we want the values.
key - The key of the document whose values we want.
Returns:
A List containing values of the appropriate type. If the named field is not a saved field, or if the given document key is not in the index, then an empty list is returned.

getTopFieldValues

public java.util.List<FieldFrequency> getTopFieldValues(java.lang.String field,
                                                        int n,
                                                        boolean ignoreCase)
Gets a list of the top n most frequent field values for a given named field. If n is < 1, all field values are returned, in order of their frequency from most to least frequent.

Parameters:
field - the name of the field to rank
n - the number of field values to return
Returns:
a List containing field values of the appropriate type for the field, ordered by frequency

getNFields

public int getNFields()

getFieldValue

public java.lang.Object getFieldValue(java.lang.String field,
                                      java.lang.String key)
Gets a single field value associated with a given field in a given document.

Parameters:
field - The name of the field for which we want the values.
key - The key of the document whose values we want.
Returns:
An Object of the appropriate type for the named field. If the named field is not a saved field, or if the given document key is not in the index, then null is returned.

Note that if there are multiple values for the given field, there is no guarantee which of the values will be returned by this method.

See Also:
getAllFieldValues(java.lang.String, java.lang.String)

reap

public void reap()
Reaps deleted partitions from the collection. Walks the partition files and deletes any that have had their removed files existing for enough time.


reapPartition

protected void reapPartition(int partNumber)
A method to reap a single partition. This can be overridden in a subclass so that the reap method will work for the super and subclass.

Parameters:
partNumber - the number of the partition to reap.

purge

public void purge()
Purges the collection. Closes and removes all partitions from the active list and writes the list.


getIndexConfig

public IndexConfig getIndexConfig()
Gets the index configuration for this manager.

Returns:
the configuration for this index

getQueryConfig

public QueryConfig getQueryConfig()
Gets the query configuration for this manager.

Returns:
the query configuration for this index

getNDocs

public int getNDocs()
Gets the total number of documents managed.

Returns:
the total number of documents

getNTokens

public long getNTokens()
Gets the total number of tokens indexed.

Returns:
the number of tokens represented by the index.

getNTerms

public int getNTerms()
Gets the total number of terms indexed.

Returns:
an the number of unique terms in the indexed material

checkHK

public void checkHK()
Checks to make sure that the housekeeper is still alive.


shutdown

public void shutdown()
              throws java.io.IOException
Shuts down the manager. Mostly this consists of writing the list of active partitions. This requires the file to be locked.

Throws:
java.io.IOException - if there is an error writing the active file of partitions or closing one of the partitions.

recover

public static void recover(java.lang.String iD)
                    throws java.io.IOException
Recovers an index directory. This method should not be called when other processes are modifying the index, or really bad stuff will happen.

Parameters:
iD - The index directory to recover.
Throws:
java.io.IOException - if there are any errors recovering the directory

makeActiveFile

protected java.io.File makeActiveFile()
Makes a File for the active file.

Returns:
a file for the active file for this index

makeTermStatsFile

protected java.io.File makeTermStatsFile(int tsn)
Makes a File for the global term stats.

Parameters:
tsn - the number of the term statistics file to make
Returns:
a file for the global term statistics for this index

makeMetaFile

protected java.io.File makeMetaFile()
Makes a File for the meta file.

Returns:
a file for the meta file file this index

makeDictionaryFile

public static java.io.File makeDictionaryFile(java.lang.String iD,
                                              int partNumber,
                                              java.lang.String type)
Makes a File for a dictionary.

Parameters:
iD - The index directory
partNumber - The number of the partition for which we're making a dictionary File.
type - The dictionary type.
Returns:
A File initialized with an appropriate path. name.

makeDictionaryFile

public java.io.File makeDictionaryFile(int partNumber,
                                       java.lang.String type)
Makes a File for a dictionary.

Parameters:
partNumber - The number of the partition for which we're making a dictionary File.
type - the dictionary type
Returns:
A File initialized with an appropriate path. name.

makePostingsFile

public static java.io.File makePostingsFile(java.lang.String iD,
                                            int partNumber,
                                            java.lang.String type,
                                            int number)
Makes a File for a postings file.

Parameters:
iD - The index directory
partNumber - The number of the partition for which we're making a postings file
type - The type of postings file
number - The number of the postings file. If this is less than 0, it will be ignored.
Returns:
A File initialized with an appropriate path. name.

makePostingsFile

public java.io.File makePostingsFile(int partNumber,
                                     java.lang.String type,
                                     int number)
Makes a File for a postings file.

Parameters:
partNumber - The number of the partition for which we're making a postings file
type - The type of postings file
number - The number of the postings file. If this is less than 0, it will be ignored.
Returns:
A File initialized with an appropriate path. name.

makePostingsFile

public java.io.File makePostingsFile(int partNumber,
                                     java.lang.String type)
Makes a File for a postings file.

Parameters:
partNumber - The number of the partition for which we're making a postings file
type - The type of postings file
Returns:
A File initialized with an appropriate path. name.

makeDeletedDocsFile

public static java.io.File makeDeletedDocsFile(java.lang.String iD,
                                               int partNumber)
Makes a File for the file containing the bitmap of deleted documents.

Parameters:
iD - The index directory
partNumber - The number of the partition for which we're making a taxonomy File.
Returns:
A File initialized with an appropriate path. name.

makeDeletedDocsFile

public java.io.File makeDeletedDocsFile(int partNumber)
Makes a File for the file containing the bitmap of deleted documents.

Parameters:
partNumber - The number of the partition for which we're making a taxonomy File.
Returns:
A File initialized with an appropriate path. name.

makeVectorLengthFile

public java.io.File makeVectorLengthFile(int partNumber)
Makes a File for the file containing the lengths of document vectors.

Parameters:
partNumber - The number of the partition for which we're making a taxonomy File.
Returns:
A File initialized with an appropriate path. name.

makeVectorLengthFile

public static java.io.File makeVectorLengthFile(java.lang.String iD,
                                                int partNumber)
Makes a File for the file containing the lengths of document vectors.

Parameters:
iD - The index directory
partNumber - The number of the partition for which we're making a removed File.
Returns:
A File initialized with an appropriate path. name.

makeRemovedPartitionFile

public static java.io.File makeRemovedPartitionFile(java.lang.String iD,
                                                    int partNumber)
Makes a File that we'll use to indicate that this partition has been merged away.

Parameters:
iD - The index directory
partNumber - The number of the partition for which we're making a removed File.
Returns:
A File initialized with an appropriate path. name.

makeRemovedPartitionFile

public java.io.File makeRemovedPartitionFile(int partNumber)
Makes a File that we'll use to indicate that this partition has been merged away.

Parameters:
partNumber - The number of the partition for which we're making a removed File.
Returns:
A File initialized with an appropriate path. name.

makeTaxonomyFile

public static java.io.File makeTaxonomyFile(java.lang.String iD,
                                            int partNumber)

makeTaxonomyFile

public java.io.File makeTaxonomyFile(int partNumber)

getEngine

public SearchEngine getEngine()
Gets the search engine associated with this PartitionManager instance

Returns:
the search engine

setEngine

public void setEngine(SearchEngineImpl engine)
Sets the search engine associated with this partition manager.

Parameters:
engine - the engine associated with this manager

noMoreMerges

public void noMoreMerges()

mergeAll

public DiskPartition mergeAll()
Merges all partitions from the active list into a new partition. The merged partitions are removed from the active list, and the new partition is then placed there.

Returns:
the merged partition. This may be an existing partition!

mergeGeometric

public java.util.List<DiskPartition> mergeGeometric()
This is a geometric merge heuristic controlled by the mergeRate. It determines which partitions on the active list can be merged.

Returns:
The list of partitions that should be merged, or null if none should be merged.
See Also:
setMergeRate(int)

merge

protected DiskPartition merge(java.util.List<DiskPartition> parts)
Merges together the partitions in the provided list. Merging is done in blocks of mergeBlockSize partitions, ordered by the partition number. We do this in order to avoid problems with running out of file handles for the files making up the partitions.

Parameters:
parts - a list of partitions to merge
Returns:
the merged partition

mergeInPieces

protected DiskPartition mergeInPieces(java.util.List<DiskPartition> parts)
Breakss a list of partitions into blocks of mergeBlockSize, and merges those. This method will work recursively if there are enough blocks to justify it.

Parameters:
parts - the partitions that we want to merge
Returns:
the merged partition

realMerge

protected DiskPartition realMerge(java.util.List<DiskPartition> diskParts,
                                  boolean calculateDVL)
Merges a list of partitions. This is the base case for the recursive merge.

Parameters:
diskParts - the actual partitions to merge
calculateDVL - if true, calculate the document vector lengths for the documents in the merged partition
Returns:
the partition resulting from the merge

getNextPartitionNumber

protected int getNextPartitionNumber()
Gets the next number to use for a partition. This requires locking the partition number file for our entry type. Errors during the reading, writing, or locking of this file will cause a random partition number between 500,000 and 1,000,000 to be generated.

Returns:
the partition number for the next partition to write.

getPartNumbers

protected java.util.List<java.lang.Integer> getPartNumbers(java.util.List<DiskPartition> parts)
Gets a list of the partition numbers for the given partitions. This can be used when we want to act on collections of partition numbers, rather than the partitions themselves.

Parameters:
parts - the partitions for which we want the numbers
Returns:
an list of the active partition numbers.

getPartitions

protected java.util.List<DiskPartition> getPartitions(java.util.List<java.lang.Integer> partNums)
Gets a list of partitions from the corresponding partition numbers. Partitions will be loaded as necessary, and numbers for partitions that do not exist will be logged and ignored.

Parameters:
partNums - the partition numbers for the partitions that we want.
Returns:
a list of the partitions corresponding to the numbers given.

getMetaFile

public MetaFile getMetaFile()
Get the meta file for this index.

Returns:
the meta file for this index.

getFieldNames

public java.util.List<java.lang.String> getFieldNames()

getFieldInfo

public FieldInfo getFieldInfo(java.lang.String name)
Gets the information for a named field.

Parameters:
name - the name of the field for which we want information
Returns:
the field information for the named field, or null if there is no field with the given name

getName

public java.lang.String getName()
Gets the name of the index.

Returns:
the configuration name of the index

getLastPurgeTime

public java.util.Date getLastPurgeTime()

getMerger

public PartitionManager.Merger getMerger()
Gets an instance of the merger class that can be used to merge any partitions that require it. If no merge is currently required, then null is returned.

Returns:
An instance of Merger that can be used to merge these partitions, or null if no merge is currently possible.

getMerger

public PartitionManager.Merger getMerger(java.util.List<DiskPartition> l)
Gets an instance of the merger class in order to merge a list of partitions.

Parameters:
l - A list of partitions to merge, such as that returned by mergeGeometric.
Returns:
An instance of Merger that can be used to merge these partitions, or null if no merge is currently possible.

getMerger

public PartitionManager.Merger getMerger(java.util.List<DiskPartition> l,
                                         FileLock localMergeLock)
Gets an instance of the merger class in order to merge a list of partitions.

Parameters:
l - A list of partitions to merge, such as that returned by mergeGeometric.
localMergeLock - the lock to use for running the merge.
Returns:
An instance of Merger that can be used to merge these partitions, or null if no merge is currently possible.

getMergerFromNumbers

public PartitionManager.Merger getMergerFromNumbers(java.util.List<java.lang.Integer> l)
Gets a merger that will merge the partitions represented by the given list of partition numbers.

Parameters:
l - a list of the numbers of some partitions that we would like to merge
Returns:
a merger for the partitions corresponding to the numbers in the list

getRandID

public int getRandID()

newProperties

public void newProperties(com.sun.labs.util.props.PropertySheet ps)
                   throws com.sun.labs.util.props.PropertyException
Specified by:
newProperties in interface com.sun.labs.util.props.Configurable
Throws:
com.sun.labs.util.props.PropertyException

isCasedIndex

public boolean isCasedIndex()
Indicates whether this index uses a cased main dictionary.

Returns:
true if the index stores cased information, false otherwise.

hasFieldedVectors

public boolean hasFieldedVectors()
Indicates whether this index uses fielded document vectors.

Returns:
true if the document vectors for this index contain field information, false otherwise.

getPartCloseDelay

public int getPartCloseDelay()

setPartCloseDelay

public void setPartCloseDelay(int partCloseDelay)

getCalculateDVL

public boolean getCalculateDVL()

getLockDir

public java.lang.String getLockDir()

setLockDir

public void setLockDir(java.lang.String lockDir)

getTermStatsDict

public TermStatsDictionary getTermStatsDict()
Gets the term statisitics dictionary for this index

Returns:
the term statistics dictionary for this index

getTermStats

public TermStatsImpl getTermStats(java.lang.String name)
Gets the term statistics for a term

Parameters:
name - the name of the term for which we want term statistics
Returns:
the statistics associated with the given name, or an empty set of term statistics if there are none for the given name

recalculateTermStats

public void recalculateTermStats()
                          throws java.io.IOException,
                                 FileLockException
Regenerates the term stats for the currently active partitions. This can be used after modifications have been made to an index manually.

Throws:
java.io.IOException - if there is any error writing the new term stats.
FileLockException - if there is an error locking the meta file to get the number for the next term stats dictionary.

updateTermStats

protected void updateTermStats()
                        throws java.io.IOException,
                               FileLockException
Throws:
java.io.IOException
FileLockException