com.sun.labs.minion.indexer.partition
Class InvFileDiskPartition

java.lang.Object
  extended by com.sun.labs.minion.indexer.partition.Partition
      extended by com.sun.labs.minion.indexer.partition.DiskPartition
          extended by com.sun.labs.minion.indexer.partition.InvFileDiskPartition
All Implemented Interfaces:
Closeable, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable, java.lang.Comparable<Partition>

public class InvFileDiskPartition
extends DiskPartition

A disk partition that holds data that is specific to the implementation of an inverted file. It extends the disk partition to add bigrams and a field store to the main and document dictionaries already present in the superclass.


Field Summary
protected  DiskBiGramDictionary bigramDict
          Bigrams from the main dictionary.
protected  DictionaryFactory bigramDictFactory
          A factory for bigram dictionaries that will be used by the main dictioanry and by the field store.
protected  java.io.RandomAccessFile bigramDictFile
          The stream for the bigram dictionaries.
protected  long bigramDictOffset
          The offset of the bigrams in the main dictionary.
protected  java.io.RandomAccessFile bigramPostFile
          The stream for the bigram postings.
protected  java.io.RandomAccessFile fieldDictFile
          The stream for the field store dictionaries.
protected  java.io.RandomAccessFile fieldPostFile
          The stream for the field store postings.
protected  DiskFieldStore fields
          The field store.
protected  DictionaryFactory fieldStoreDictFactory
          A factory for the dictionaries that the field store will use for saved field values.
protected static java.lang.String logTag
           
protected  DiskDictionary ngrams
          The ngram dictionary.
protected  DiskTaxonomy taxonomy
          A disk taxonomy, if one exists.
 
Fields inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
BUFF_SIZE, deletions, delFile, delFileLock, docDict, docDictFile, docPostFile, documentDictFactory, dvl, ignored, mainDict, mainFiles, MATCH_CUT_OFF, MIN_LEN, removedFile, termCache
 
Fields inherited from class com.sun.labs.minion.indexer.partition.Partition
DICT_OFFSETS_SIZE, docDictFactory, entryClass, entryName, indexConfig, mainDictFactory, mainDictFile, mainPostFiles, manager, maxID, nEntries, partNumber, PROP_DOC_DICT_FACTORY, PROP_INDEX_CONFIG, PROP_MAIN_DICT_FACTORY, PROP_PARTITION_MANAGER, stats
 
Constructor Summary
InvFileDiskPartition(int partNumber, PartitionManager manager, DictionaryFactory mainDictFactory, DictionaryFactory documentDictFactory, DictionaryFactory fieldStoreDictFactory, DictionaryFactory bigramDictFactory, boolean cacheVectorLengths, int termCacheSize)
          Opens a partition with a given number
 
Method Summary
 boolean close(long currTime)
          Close the files associated with this partition.
 double[] euclideanDistance(double[] vec, java.lang.String field)
          Computes the euclidean distance between the given document and all documents.
 void export(java.io.PrintWriter o)
          Exports the data in this partition to an XML file format.
protected  java.io.File[] getAllFiles()
          Gets all the files associated with a partition, including those specific to the inverted file.
protected static java.io.File[] getAllFiles(PartitionManager manager, int partNumber)
          Gets all the files associated with a partition, including those specific to the inverted file.
protected  java.io.File[] getBigramFiles()
          Gets the files associated with the bigram postings for a partition.
 int getFieldCount()
          Gets the number of defined fields.
protected  java.io.File[] getFieldFiles()
          Gets the files associated with the field store for a partition.
 DictionaryIterator getFieldIterator(java.lang.String name)
          Gets an iterator for all of the values in a field.
 DictionaryIterator getFieldIterator(java.lang.String name, boolean caseSensitive, java.lang.Object lowerBound, boolean includeLower, java.lang.Object upperBound, boolean includeUpper)
          Gets an iterator for the values in a given range in a field.
 PostingsIterator getFieldPostings(java.lang.String name, java.lang.Object value, boolean caseSensitive)
          Gets the postings associated with a particular field value.
 int getFieldSize(java.lang.String name)
           
 DiskFieldStore getFieldStore()
          Gets the field store associated with this partition.
 QueryEntry[] getMatching(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)
          Gets the entries matching the given pattern
 DictionaryIterator getMatchingIterator(java.lang.String name, java.lang.String val, boolean caseSensitive)
          Gets an iterator for the character saved field values that match a given wildcard pattern.
 java.lang.Object getSavedFieldData(FieldInfo fi, int docID, boolean all)
           
 java.util.List getSavedFieldData(java.lang.String name, int docID)
          Gets all of the data saved in a given field.
 java.lang.Object getSavedFieldData(java.lang.String name, int docID, boolean all)
          Gets some or all of the data saved in a given field.
 java.util.List getSavedFieldData(java.lang.String name, java.lang.String key)
          Gets all of the the data saved in a given field, in a given document.
 java.lang.Object getSavedFieldData(java.lang.String name, java.lang.String key, boolean all)
          Gets some or all of the data saved in a given field, in a given document.
 java.util.Map<java.lang.String,java.util.List> getSavedFields(int docID)
          Gets an iterator for all the saved fields in a document.
 QueryEntry[] getSpellingVariants(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)
          Gets the spelling variants of a term
 QueryEntry[] getStemMatches(java.lang.String term, boolean caseSensitive, int minLen, float matchCutOff, int maxEntries, long timeLimit)
          Gets the entries that match the stem of the given term.
 QueryEntry[] getStemMatches(java.lang.String term, boolean caseSensitive, int maxEntries, long timeLimit)
          Gets the entries that match the stem of the given term.
 QueryEntry[] getSubstring(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)
          Gets the entries containing the given substring.
 DictionaryIterator getSubstringIterator(java.lang.String name, java.lang.String val, boolean caseSensitive, boolean starts, boolean ends)
          Gets an iterator for the character saved field values that contain a given substring.
 java.util.Set getSubsumed(java.lang.String name)
          Gets the entries subsumed by a given name.
 DiskTaxonomy getTaxonomy()
           
protected  void initAll()
          Initializes everything all at once.
protected  void initBigramDict()
          Initializes the bigram dictionary, if necessary.
protected  void initFields()
          Initializes the field store, if necessary.
protected  void initTaxonomy()
          Initialise the taxonomy, should one be necessary.
protected  void mergeCustom(int newPartNumber, DiskPartition[] sortedParts, int[][] idMaps, int newMaxDocID, int[] docIDStart, int[] nUndel, int[][] docIDMaps)
          Provides a place to merge data that is specific to a subclass of disk partition.
protected static void reap(PartitionManager m, int n)
          Reaps the given partition.
 
Methods inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
close, createRemoveFile, delete, deleteDocument, deleteDocument, docsAreMerged, getAverageDocumentLength, getCloseTime, getDeletedDocumentsMap, getDelMap, getDocIDMap, getDocumentIterator, getDocumentIterator, getDocumentLength, getDocumentTerm, getDocumentTerm, getDocumentVectorLength, getDocumentVectorLength, getDocumentVectorLength, getDVL, getInputBuffers, getMainDictionary, getMainDictionaryIterator, getMainDictionaryIterator, getMainIterator, getMaxDocumentID, getMaxTermID, getNDocs, getNEntries, getNTokens, getTerm, getTerm, getTerm, getTerm, getTermCache, initDocDict, initDVL, initMainDict, initMainFiles, isDeleted, isIndexed, merge, merge, normalize, setCloseTime, syncDeletedMap, toString, updatePartition
 
Methods inherited from class com.sun.labs.minion.indexer.partition.Partition
compareTo, getDocFiles, getDocFiles, getIndexConfig, getMainFiles, getMainFiles, getManager, getName, getNumPostingsChannels, getPartitionNumber, getQueryConfig, getStats, newProperties
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

fieldStoreDictFactory

protected DictionaryFactory fieldStoreDictFactory
A factory for the dictionaries that the field store will use for saved field values.


bigramDictFactory

protected DictionaryFactory bigramDictFactory
A factory for bigram dictionaries that will be used by the main dictioanry and by the field store.


bigramDict

protected DiskBiGramDictionary bigramDict
Bigrams from the main dictionary.


taxonomy

protected DiskTaxonomy taxonomy
A disk taxonomy, if one exists.


bigramDictOffset

protected long bigramDictOffset
The offset of the bigrams in the main dictionary.


fields

protected DiskFieldStore fields
The field store.


ngrams

protected DiskDictionary ngrams
The ngram dictionary.


bigramDictFile

protected java.io.RandomAccessFile bigramDictFile
The stream for the bigram dictionaries.


bigramPostFile

protected java.io.RandomAccessFile bigramPostFile
The stream for the bigram postings.


fieldDictFile

protected java.io.RandomAccessFile fieldDictFile
The stream for the field store dictionaries.


fieldPostFile

protected java.io.RandomAccessFile fieldPostFile
The stream for the field store postings.


logTag

protected static java.lang.String logTag
Constructor Detail

InvFileDiskPartition

public InvFileDiskPartition(int partNumber,
                            PartitionManager manager,
                            DictionaryFactory mainDictFactory,
                            DictionaryFactory documentDictFactory,
                            DictionaryFactory fieldStoreDictFactory,
                            DictionaryFactory bigramDictFactory,
                            boolean cacheVectorLengths,
                            int termCacheSize)
                     throws java.io.IOException
Opens a partition with a given number

Parameters:
partNumber - the number of this partition.
manager - the manager for this partition.
mainDictFactory - a factory that will be used to generate the main dictionary for this partition
documentDictFactory - a factory that will be used to generate the document dictionary for this partition
fieldStoreDictFactory - a factory that will be used to generate the dictionaries in the field store
bigramDictFactory - a factory that will be used to generate the bigram dictionaries needed for this partition
Throws:
java.io.IOException - If there is an error opening or reading any of the files making up a partition.
See Also:
Partition, Dictionary
Method Detail

initAll

protected void initAll()
                throws java.io.IOException
Initializes everything all at once.

Overrides:
initAll in class DiskPartition
Throws:
java.io.IOException - if there was an error reading the files

initBigramDict

protected void initBigramDict()
Initializes the bigram dictionary, if necessary.


initFields

protected void initFields()
Initializes the field store, if necessary.


initTaxonomy

protected void initTaxonomy()
Initialise the taxonomy, should one be necessary. A taxonomy is initialised if the manager's indexConfig responds that one is necessary.


getSavedFieldData

public java.lang.Object getSavedFieldData(java.lang.String name,
                                          java.lang.String key,
                                          boolean all)
Gets some or all of the data saved in a given field, in a given document.

Parameters:
name - The name of the field.
key - The document key of the document for which we want data.
all - If true, all field values will be returned as a list. If false only the first value will be returned.
Returns:
A list of the values saved in the given field in the given document, or null if the given key is not in this partition.

getSavedFieldData

public java.lang.Object getSavedFieldData(java.lang.String name,
                                          int docID,
                                          boolean all)
Gets some or all of the data saved in a given field.

Parameters:
name - The name of the field.
docID - The document ID for which we want the saved data.
all - If true, return all known values for the field in the given document. If false return only one value.
Returns:
If all is true, then return a List of field values, otherwise, return a single field value of the appropriate type. If all is false, a single value of the appropriate type will be returned.

If the given name is not the name of a saved field, or the document ID is invalid, then if all is true, an empty list will be returned. If all is false, null will be returned.


getSavedFieldData

public java.util.List getSavedFieldData(java.lang.String name,
                                        int docID)
Gets all of the data saved in a given field.

Parameters:
name - The name of the field.
docID - The document ID for which we want the saved data.
Returns:
a List of field values of the appropriate type. If the given name is not the name of a saved field, or the document ID is invalid, then an empty list is returned.

getSavedFieldData

public java.util.List getSavedFieldData(java.lang.String name,
                                        java.lang.String key)
Gets all of the the data saved in a given field, in a given document.

Parameters:
name - The name of the field.
key - The document key of the document for which we want data.
Returns:
The field values for the given document, as a List

If the given name is not the name of a saved field, or the document ID is invalid, then an empty list will be returned.


getSavedFieldData

public java.lang.Object getSavedFieldData(FieldInfo fi,
                                          int docID,
                                          boolean all)

getSavedFields

public java.util.Map<java.lang.String,java.util.List> getSavedFields(int docID)
Gets an iterator for all the saved fields in a document.


getFieldIterator

public DictionaryIterator getFieldIterator(java.lang.String name)
Gets an iterator for all of the values in a field.

Parameters:
name - The name of the field we need an iterator for.
Returns:
An iterator for the values in the field.

getFieldIterator

public DictionaryIterator getFieldIterator(java.lang.String name,
                                           boolean caseSensitive,
                                           java.lang.Object lowerBound,
                                           boolean includeLower,
                                           java.lang.Object upperBound,
                                           boolean includeUpper)
Gets an iterator for the values in a given range in a field.

Parameters:
name - The name of the field we need an iterator for.
caseSensitive - If true, case should be taken into account when iterating through the values. This value will only be observed for character fields!
lowerBound - The lower bound on the iterator. If null, only the upper bound is considered and the iteration will commence with the first term in the dictionary.
includeLower - If true, then the lower bound will be included in the entries returned by the iterator, if it occurs in the dictionary.
upperBound - The upper bound on the iterator. If null, only the lower bound is considered and the iteration will end at the last term in the dictionary.
includeUpper - If true, then the upper bound will be included in the entries returned by the iterator, if it occurs in the dictionary.
Returns:
An iterator for the dictionary entries contained in the range, or null if there is no such range or the named field is not a saved field.

getMatchingIterator

public DictionaryIterator getMatchingIterator(java.lang.String name,
                                              java.lang.String val,
                                              boolean caseSensitive)
Gets an iterator for the character saved field values that match a given wildcard pattern.

Parameters:
name - The name of the field whose values we wish to match against.
val - The wildcard value against which we will match.
caseSensitive - If true, then case will be taken into account during the match.

getSubstringIterator

public DictionaryIterator getSubstringIterator(java.lang.String name,
                                               java.lang.String val,
                                               boolean caseSensitive,
                                               boolean starts,
                                               boolean ends)
Gets an iterator for the character saved field values that contain a given substring.

Parameters:
name - The name of the field whose values we wish to match against.
val - The wildcard value against which we will match.
caseSensitive - If true, then case will be taken into account during the match.

getFieldPostings

public PostingsIterator getFieldPostings(java.lang.String name,
                                         java.lang.Object value,
                                         boolean caseSensitive)
Gets the postings associated with a particular field value.

Parameters:
name - The name of the field for which we want postings.
value - The value from the field for which we want postings.
caseSensitive - If true, case should be taken into account when iterating through the values. This value will only be observed for character fields!
Returns:
The postings associated with that value, or null if there is no such value in the field.

getFieldCount

public int getFieldCount()
Gets the number of defined fields.


getFieldStore

public DiskFieldStore getFieldStore()
Gets the field store associated with this partition.


getFieldSize

public int getFieldSize(java.lang.String name)

getSubsumed

public java.util.Set getSubsumed(java.lang.String name)
Gets the entries subsumed by a given name.

Parameters:
name - the name for which we want subsumed entries.
Returns:
the subsumed entries, or null if this name is not in the main dictionary.

euclideanDistance

public double[] euclideanDistance(double[] vec,
                                  java.lang.String field)
Computes the euclidean distance between the given document and all documents. The distance is based on the features stored in the saved field with the given name.


mergeCustom

protected void mergeCustom(int newPartNumber,
                           DiskPartition[] sortedParts,
                           int[][] idMaps,
                           int newMaxDocID,
                           int[] docIDStart,
                           int[] nUndel,
                           int[][] docIDMaps)
                    throws java.lang.Exception
Description copied from class: DiskPartition
Provides a place to merge data that is specific to a subclass of disk partition. This method will be called after the disk partition data is merged, but inside the try block for the whole merge.

Overrides:
mergeCustom in class DiskPartition
Parameters:
newPartNumber - the number of the new partition
sortedParts - the sorted list of partitions
idMaps - a set of maps from old entry ids in the main dictionary to new entry ids in the merged dictionary
newMaxDocID - the new maximum document id
docIDStart - the starting doc ids
nUndel - the number of undeleted documents in each partition
docIDMaps - doc id maps (see merge)
Throws:
java.lang.Exception

close

public boolean close(long currTime)
Close the files associated with this partition.

Specified by:
close in interface Closeable
Overrides:
close in class DiskPartition
Parameters:
currTime - the current time
Returns:
true if the thing was closed, false otherwise.

getMatching

public QueryEntry[] getMatching(java.lang.String pat,
                                boolean caseSensitive,
                                int maxEntries,
                                long timeLimit)
Gets the entries matching the given pattern

Parameters:
pat - The pattern to match entries against.
caseSensitive - If true, then do the lookup in a case sensitive fashion.
maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.
timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:
An array of entries containing the matching entries, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getSpellingVariants

public QueryEntry[] getSpellingVariants(java.lang.String pat,
                                        boolean caseSensitive,
                                        int maxEntries,
                                        long timeLimit)
Gets the spelling variants of a term

Parameters:
pat - The pattern to match entries against.
caseSensitive - If true, then do the lookup in a case sensitive fashion.
maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.
timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:
An array of entries containing the spelling variants, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getSubstring

public QueryEntry[] getSubstring(java.lang.String pat,
                                 boolean caseSensitive,
                                 int maxEntries,
                                 long timeLimit)
Gets the entries containing the given substring.

Parameters:
pat - The pattern to match entries against.
caseSensitive - If true, then do the lookup in a case sensitive fashion.
maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.
timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:
An array of Term objects containing the matching entries, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getStemMatches

public QueryEntry[] getStemMatches(java.lang.String term,
                                   boolean caseSensitive,
                                   int maxEntries,
                                   long timeLimit)
Gets the entries that match the stem of the given term. Uses the default minimum length and match cutoff values.

Parameters:
term - The term we want to get variants of.
caseSensitive - If true, then do the lookup in a case sensitive fashion.
maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.
timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:
An array of Term objects containing the matching entries, or null if there are not such entries.

getStemMatches

public QueryEntry[] getStemMatches(java.lang.String term,
                                   boolean caseSensitive,
                                   int minLen,
                                   float matchCutOff,
                                   int maxEntries,
                                   long timeLimit)
Gets the entries that match the stem of the given term.

Parameters:
term - The term we want to get variants of.
caseSensitive - If true, then do the lookup in a case sensitive fashion.
minLen - The minimum term length for stemming.
matchCutOff - The cutoff score for matching variants and the original term.
maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.
timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:
An array of Term objects containing the matching entries, or null if there are not such entries.

getFieldFiles

protected java.io.File[] getFieldFiles()
Gets the files associated with the field store for a partition.

Returns:
an array of files. The first is for the dictionary, and the remaining are for the postings files.

getAllFiles

protected java.io.File[] getAllFiles()
Gets all the files associated with a partition, including those specific to the inverted file.

Overrides:
getAllFiles in class Partition
Returns:
an array of files

getAllFiles

protected static java.io.File[] getAllFiles(PartitionManager manager,
                                            int partNumber)
Gets all the files associated with a partition, including those specific to the inverted file.

Returns:
an array of files

reap

protected static void reap(PartitionManager m,
                           int n)
Reaps the given partition. If the postings file cannot be removed, then we return control immediately.

Parameters:
m - The manager associated with the partition.
n - The partition number to reap.

getBigramFiles

protected java.io.File[] getBigramFiles()
Gets the files associated with the bigram postings for a partition.

Returns:
an array of files. The first is for the dictionary, and the remaining are for the postings files.

getTaxonomy

public DiskTaxonomy getTaxonomy()
Returns:
Returns the taxonomy.

export

public void export(java.io.PrintWriter o)
Exports the data in this partition to an XML file format.

Parameters:
o - the writer to which the data will be output.