InvFileDiskPartition (Minion Search Engine)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.indexer.partition
Class InvFileDiskPartition

java.lang.Object
  com.sun.labs.minion.indexer.partition.Partition
      com.sun.labs.minion.indexer.partition.DiskPartition
          com.sun.labs.minion.indexer.partition.InvFileDiskPartition

All Implemented Interfaces:: Closeable, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable, java.lang.Comparable<Partition>

public class InvFileDiskPartition
extends DiskPartition
extends DiskPartition

A disk partition that holds data that is specific to the implementation of an inverted file. It extends the disk partition to add bigrams and a field store to the main and document dictionaries already present in the superclass.

Field Summary
`protected DiskBiGramDictionary`	`bigramDict` Bigrams from the main dictionary.
`protected DictionaryFactory`	`bigramDictFactory` A factory for bigram dictionaries that will be used by the main dictioanry and by the field store.
`protected java.io.RandomAccessFile`	`bigramDictFile` The stream for the bigram dictionaries.
`protected long`	`bigramDictOffset` The offset of the bigrams in the main dictionary.
`protected java.io.RandomAccessFile`	`bigramPostFile` The stream for the bigram postings.
`protected java.io.RandomAccessFile`	`fieldDictFile` The stream for the field store dictionaries.
`protected java.io.RandomAccessFile`	`fieldPostFile` The stream for the field store postings.
`protected DiskFieldStore`	`fields` The field store.
`protected DictionaryFactory`	`fieldStoreDictFactory` A factory for the dictionaries that the field store will use for saved field values.
`protected static java.lang.String`	`logTag`
`protected DiskDictionary`	`ngrams` The ngram dictionary.
`protected DiskTaxonomy`	`taxonomy` A disk taxonomy, if one exists.

Fields inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
`BUFF_SIZE, deletions, delFile, delFileLock, docDict, docDictFile, docPostFile, documentDictFactory, dvl, ignored, mainDict, mainFiles, MATCH_CUT_OFF, MIN_LEN, removedFile, termCache`

Fields inherited from class com.sun.labs.minion.indexer.partition.Partition
`DICT_OFFSETS_SIZE, docDictFactory, entryClass, entryName, indexConfig, mainDictFactory, mainDictFile, mainPostFiles, manager, maxID, nEntries, partNumber, PROP_DOC_DICT_FACTORY, PROP_INDEX_CONFIG, PROP_MAIN_DICT_FACTORY, PROP_PARTITION_MANAGER, stats`

Constructor Summary
`InvFileDiskPartition(int partNumber, PartitionManager manager, DictionaryFactory mainDictFactory, DictionaryFactory documentDictFactory, DictionaryFactory fieldStoreDictFactory, DictionaryFactory bigramDictFactory, boolean cacheVectorLengths, int termCacheSize)` Opens a partition with a given number

Method Summary
`boolean`	`close(long currTime)` Close the files associated with this partition.
`double[]`	`euclideanDistance(double[] vec, java.lang.String field)` Computes the euclidean distance between the given document and all documents.
`void`	`export(java.io.PrintWriter o)` Exports the data in this partition to an XML file format.
`protected java.io.File[]`	`getAllFiles()` Gets all the files associated with a partition, including those specific to the inverted file.
`protected static java.io.File[]`	`getAllFiles(PartitionManager manager, int partNumber)` Gets all the files associated with a partition, including those specific to the inverted file.
`protected java.io.File[]`	`getBigramFiles()` Gets the files associated with the bigram postings for a partition.
`int`	`getFieldCount()` Gets the number of defined fields.
`protected java.io.File[]`	`getFieldFiles()` Gets the files associated with the field store for a partition.
`DictionaryIterator`	`getFieldIterator(java.lang.String name)` Gets an iterator for all of the values in a field.
`DictionaryIterator`	`getFieldIterator(java.lang.String name, boolean caseSensitive, java.lang.Object lowerBound, boolean includeLower, java.lang.Object upperBound, boolean includeUpper)` Gets an iterator for the values in a given range in a field.
`PostingsIterator`	`getFieldPostings(java.lang.String name, java.lang.Object value, boolean caseSensitive)` Gets the postings associated with a particular field value.
`int`	`getFieldSize(java.lang.String name)`
`DiskFieldStore`	`getFieldStore()` Gets the field store associated with this partition.
`QueryEntry[]`	`getMatching(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)` Gets the entries matching the given pattern
`DictionaryIterator`	`getMatchingIterator(java.lang.String name, java.lang.String val, boolean caseSensitive)` Gets an iterator for the character saved field values that match a given wildcard pattern.
`java.lang.Object`	`getSavedFieldData(FieldInfo fi, int docID, boolean all)`
`java.util.List`	`getSavedFieldData(java.lang.String name, int docID)` Gets all of the data saved in a given field.
`java.lang.Object`	`getSavedFieldData(java.lang.String name, int docID, boolean all)` Gets some or all of the data saved in a given field.
`java.util.List`	`getSavedFieldData(java.lang.String name, java.lang.String key)` Gets all of the the data saved in a given field, in a given document.
`java.lang.Object`	`getSavedFieldData(java.lang.String name, java.lang.String key, boolean all)` Gets some or all of the data saved in a given field, in a given document.
`java.util.Map<java.lang.String,java.util.List>`	`getSavedFields(int docID)` Gets an iterator for all the saved fields in a document.
`QueryEntry[]`	`getSpellingVariants(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)` Gets the spelling variants of a term
`QueryEntry[]`	`getStemMatches(java.lang.String term, boolean caseSensitive, int minLen, float matchCutOff, int maxEntries, long timeLimit)` Gets the entries that match the stem of the given term.
`QueryEntry[]`	`getStemMatches(java.lang.String term, boolean caseSensitive, int maxEntries, long timeLimit)` Gets the entries that match the stem of the given term.
`QueryEntry[]`	`getSubstring(java.lang.String pat, boolean caseSensitive, int maxEntries, long timeLimit)` Gets the entries containing the given substring.
`DictionaryIterator`	`getSubstringIterator(java.lang.String name, java.lang.String val, boolean caseSensitive, boolean starts, boolean ends)` Gets an iterator for the character saved field values that contain a given substring.
`java.util.Set`	`getSubsumed(java.lang.String name)` Gets the entries subsumed by a given name.
`DiskTaxonomy`	`getTaxonomy()`
`protected void`	`initAll()` Initializes everything all at once.
`protected void`	`initBigramDict()` Initializes the bigram dictionary, if necessary.
`protected void`	`initFields()` Initializes the field store, if necessary.
`protected void`	`initTaxonomy()` Initialise the taxonomy, should one be necessary.
`protected void`	`mergeCustom(int newPartNumber, DiskPartition[] sortedParts, int[][] idMaps, int newMaxDocID, int[] docIDStart, int[] nUndel, int[][] docIDMaps)` Provides a place to merge data that is specific to a subclass of disk partition.
`protected static void`	`reap(PartitionManager m, int n)` Reaps the given partition.

Methods inherited from class com.sun.labs.minion.indexer.partition.DiskPartition
close, createRemoveFile, delete, deleteDocument, deleteDocument, docsAreMerged, getAverageDocumentLength, getCloseTime, getDeletedDocumentsMap, getDelMap, getDocIDMap, getDocumentIterator, getDocumentIterator, getDocumentLength, getDocumentTerm, getDocumentTerm, getDocumentVectorLength, getDocumentVectorLength, getDocumentVectorLength, getDVL, getInputBuffers, getMainDictionary, getMainDictionaryIterator, getMainDictionaryIterator, getMainIterator, getMaxDocumentID, getMaxTermID, getNDocs, getNEntries, getNTokens, getTerm, getTerm, getTerm, getTerm, getTermCache, initDocDict, initDVL, initMainDict, initMainFiles, isDeleted, isIndexed, merge, merge, normalize, setCloseTime, syncDeletedMap, toString, updatePartition

Methods inherited from class com.sun.labs.minion.indexer.partition.DiskPartition

close, createRemoveFile, delete, deleteDocument, deleteDocument, docsAreMerged, getAverageDocumentLength, getCloseTime, getDeletedDocumentsMap, getDelMap, getDocIDMap, getDocumentIterator, getDocumentIterator, getDocumentLength, getDocumentTerm, getDocumentTerm, getDocumentVectorLength, getDocumentVectorLength, getDocumentVectorLength, getDVL, getInputBuffers, getMainDictionary, getMainDictionaryIterator, getMainDictionaryIterator, getMainIterator, getMaxDocumentID, getMaxTermID, getNDocs, getNEntries, getNTokens, getTerm, getTerm, getTerm, getTerm, getTermCache, initDocDict, initDVL, initMainDict, initMainFiles, isDeleted, isIndexed, merge, merge, normalize, setCloseTime, syncDeletedMap, toString, updatePartition

Methods inherited from class com.sun.labs.minion.indexer.partition.Partition
`compareTo, getDocFiles, getDocFiles, getIndexConfig, getMainFiles, getMainFiles, getManager, getName, getNumPostingsChannels, getPartitionNumber, getQueryConfig, getStats, newProperties`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

fieldStoreDictFactory

protected DictionaryFactory fieldStoreDictFactory

A factory for the dictionaries that the field store will use for saved field values.

bigramDictFactory

protected DictionaryFactory bigramDictFactory

A factory for bigram dictionaries that will be used by the main dictioanry and by the field store.

bigramDict

protected DiskBiGramDictionary bigramDict

Bigrams from the main dictionary.

taxonomy

protected DiskTaxonomy taxonomy

A disk taxonomy, if one exists.

bigramDictOffset

protected long bigramDictOffset

The offset of the bigrams in the main dictionary.

fields

protected DiskFieldStore fields

The field store.

ngrams

protected DiskDictionary ngrams

The ngram dictionary.

bigramDictFile

protected java.io.RandomAccessFile bigramDictFile

The stream for the bigram dictionaries.

bigramPostFile

protected java.io.RandomAccessFile bigramPostFile

The stream for the bigram postings.

fieldDictFile

protected java.io.RandomAccessFile fieldDictFile

The stream for the field store dictionaries.

fieldPostFile

protected java.io.RandomAccessFile fieldPostFile

The stream for the field store postings.

logTag

protected static java.lang.String logTag

Constructor Detail

InvFileDiskPartition

public InvFileDiskPartition(int partNumber,
                            PartitionManager manager,
                            DictionaryFactory mainDictFactory,
                            DictionaryFactory documentDictFactory,
                            DictionaryFactory fieldStoreDictFactory,
                            DictionaryFactory bigramDictFactory,
                            boolean cacheVectorLengths,
                            int termCacheSize)
                     throws java.io.IOException

Opens a partition with a given number

Parameters:: partNumber - the number of this partition.; manager - the manager for this partition.; mainDictFactory - a factory that will be used to generate the main dictionary for this partition; documentDictFactory - a factory that will be used to generate the document dictionary for this partition; fieldStoreDictFactory - a factory that will be used to generate the dictionaries in the field store; bigramDictFactory - a factory that will be used to generate the bigram dictionaries needed for this partition
Throws:: java.io.IOException - If there is an error opening or reading any of the files making up a partition.
See Also:: Partition, Dictionary

Method Detail

initAll

protected void initAll()
                throws java.io.IOException

Initializes everything all at once.

Overrides:: initAll in class DiskPartition

Throws:: java.io.IOException - if there was an error reading the files

initBigramDict

protected void initBigramDict()

Initializes the bigram dictionary, if necessary.

initFields

protected void initFields()

Initializes the field store, if necessary.

initTaxonomy

protected void initTaxonomy()

Initialise the taxonomy, should one be necessary. A taxonomy is initialised if the manager's indexConfig responds that one is necessary.

getSavedFieldData

public java.lang.Object getSavedFieldData(java.lang.String name,
                                          java.lang.String key,
                                          boolean all)

Gets some or all of the data saved in a given field, in a given document.

Parameters:: name - The name of the field.; key - The document key of the document for which we want data.; all - If true, all field values will be returned as a list. If false only the first value will be returned.
Returns:: A list of the values saved in the given field in the given document, or null if the given key is not in this partition.

getSavedFieldData

public java.lang.Object getSavedFieldData(java.lang.String name,
                                          int docID,
                                          boolean all)

Gets some or all of the data saved in a given field.

Parameters:: name - The name of the field.; docID - The document ID for which we want the saved data.; all - If true, return all known values for the field in the given document. If false return only one value.
Returns:: If all is true, then return a List of field values, otherwise, return a single field value of the appropriate type. If all is false, a single value of the appropriate type will be returned.
If the given name is not the name of a saved field, or the document ID is invalid, then if all is true, an empty list will be returned. If all is false, null will be returned.

getSavedFieldData

public java.util.List getSavedFieldData(java.lang.String name,
                                        int docID)

Gets all of the data saved in a given field.

Parameters:: name - The name of the field.; docID - The document ID for which we want the saved data.
Returns:: a List of field values of the appropriate type. If the given name is not the name of a saved field, or the document ID is invalid, then an empty list is returned.

getSavedFieldData

public java.util.List getSavedFieldData(java.lang.String name,
                                        java.lang.String key)

Gets all of the the data saved in a given field, in a given document.

Parameters:: name - The name of the field.; key - The document key of the document for which we want data.
Returns:: The field values for the given document, as a List
If the given name is not the name of a saved field, or the document ID is invalid, then an empty list will be returned.

getSavedFieldData

public java.lang.Object getSavedFieldData(FieldInfo fi,
                                          int docID,
                                          boolean all)

getSavedFields

public java.util.Map<java.lang.String,java.util.List> getSavedFields(int docID)

Gets an iterator for all the saved fields in a document.

getFieldIterator

public DictionaryIterator getFieldIterator(java.lang.String name)

Gets an iterator for all of the values in a field.

Parameters:: name - The name of the field we need an iterator for.
Returns:: An iterator for the values in the field.

getFieldIterator

public DictionaryIterator getFieldIterator(java.lang.String name,
                                           boolean caseSensitive,
                                           java.lang.Object lowerBound,
                                           boolean includeLower,
                                           java.lang.Object upperBound,
                                           boolean includeUpper)

Gets an iterator for the values in a given range in a field.

Parameters:: name - The name of the field we need an iterator for.; caseSensitive - If true, case should be taken into account when iterating through the values. This value will only be observed for character fields!; lowerBound - The lower bound on the iterator. If null, only the upper bound is considered and the iteration will commence with the first term in the dictionary.; includeLower - If true, then the lower bound will be included in the entries returned by the iterator, if it occurs in the dictionary.; upperBound - The upper bound on the iterator. If null, only the lower bound is considered and the iteration will end at the last term in the dictionary.; includeUpper - If true, then the upper bound will be included in the entries returned by the iterator, if it occurs in the dictionary.
Returns:: An iterator for the dictionary entries contained in the range, or null if there is no such range or the named field is not a saved field.

getMatchingIterator

public DictionaryIterator getMatchingIterator(java.lang.String name,
                                              java.lang.String val,
                                              boolean caseSensitive)

Gets an iterator for the character saved field values that match a given wildcard pattern.

Parameters:: name - The name of the field whose values we wish to match against.; val - The wildcard value against which we will match.; caseSensitive - If true, then case will be taken into account during the match.

getSubstringIterator

public DictionaryIterator getSubstringIterator(java.lang.String name,
                                               java.lang.String val,
                                               boolean caseSensitive,
                                               boolean starts,
                                               boolean ends)

Gets an iterator for the character saved field values that contain a given substring.

Parameters:: name - The name of the field whose values we wish to match against.; val - The wildcard value against which we will match.; caseSensitive - If true, then case will be taken into account during the match.

getFieldPostings

public PostingsIterator getFieldPostings(java.lang.String name,
                                         java.lang.Object value,
                                         boolean caseSensitive)

Gets the postings associated with a particular field value.

Parameters:: name - The name of the field for which we want postings.; value - The value from the field for which we want postings.; caseSensitive - If true, case should be taken into account when iterating through the values. This value will only be observed for character fields!
Returns:: The postings associated with that value, or null if there is no such value in the field.

getFieldCount

public int getFieldCount()

Gets the number of defined fields.

getFieldStore

public DiskFieldStore getFieldStore()

Gets the field store associated with this partition.

getFieldSize

public int getFieldSize(java.lang.String name)

getSubsumed

public java.util.Set getSubsumed(java.lang.String name)

Gets the entries subsumed by a given name.

Parameters:: name - the name for which we want subsumed entries.
Returns:: the subsumed entries, or null if this name is not in the main dictionary.

euclideanDistance

public double[] euclideanDistance(double[] vec,
                                  java.lang.String field)

Computes the euclidean distance between the given document and all documents. The distance is based on the features stored in the saved field with the given name.

mergeCustom

protected void mergeCustom(int newPartNumber,
                           DiskPartition[] sortedParts,
                           int[][] idMaps,
                           int newMaxDocID,
                           int[] docIDStart,
                           int[] nUndel,
                           int[][] docIDMaps)
                    throws java.lang.Exception

Description copied from class: DiskPartition

Provides a place to merge data that is specific to a subclass of disk partition. This method will be called after the disk partition data is merged, but inside the try block for the whole merge.

Overrides:: mergeCustom in class DiskPartition

Parameters:: newPartNumber - the number of the new partition; sortedParts - the sorted list of partitions; idMaps - a set of maps from old entry ids in the main dictionary to new entry ids in the merged dictionary; newMaxDocID - the new maximum document id; docIDStart - the starting doc ids; nUndel - the number of undeleted documents in each partition; docIDMaps - doc id maps (see merge)
Throws:: java.lang.Exception

close

public boolean close(long currTime)

Close the files associated with this partition.

Specified by:: close in interface Closeable
Overrides:: close in class DiskPartition

Parameters:: currTime - the current time
Returns:: true if the thing was closed, false otherwise.

getMatching

public QueryEntry[] getMatching(java.lang.String pat,
                                boolean caseSensitive,
                                int maxEntries,
                                long timeLimit)

Gets the entries matching the given pattern

Parameters:: pat - The pattern to match entries against.; caseSensitive - If true, then do the lookup in a case sensitive fashion.; maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.; timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:: An array of entries containing the matching entries, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getSpellingVariants

public QueryEntry[] getSpellingVariants(java.lang.String pat,
                                        boolean caseSensitive,
                                        int maxEntries,
                                        long timeLimit)

Gets the spelling variants of a term

Parameters:: pat - The pattern to match entries against.; caseSensitive - If true, then do the lookup in a case sensitive fashion.; maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.; timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:: An array of entries containing the spelling variants, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getSubstring

public QueryEntry[] getSubstring(java.lang.String pat,
                                 boolean caseSensitive,
                                 int maxEntries,
                                 long timeLimit)

Gets the entries containing the given substring.

Parameters:: pat - The pattern to match entries against.; caseSensitive - If true, then do the lookup in a case sensitive fashion.; maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.; timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:: An array of Term objects containing the matching entries, or null if there are not such entries, or an array of length zero if the operation timed out before any entries could be matched

getStemMatches

public QueryEntry[] getStemMatches(java.lang.String term,
                                   boolean caseSensitive,
                                   int maxEntries,
                                   long timeLimit)

Gets the entries that match the stem of the given term. Uses the default minimum length and match cutoff values.

Parameters:: term - The term we want to get variants of.; caseSensitive - If true, then do the lookup in a case sensitive fashion.; maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.; timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:: An array of Term objects containing the matching entries, or null if there are not such entries.

getStemMatches

public QueryEntry[] getStemMatches(java.lang.String term,
                                   boolean caseSensitive,
                                   int minLen,
                                   float matchCutOff,
                                   int maxEntries,
                                   long timeLimit)

Gets the entries that match the stem of the given term.

Parameters:: term - The term we want to get variants of.; caseSensitive - If true, then do the lookup in a case sensitive fashion.; minLen - The minimum term length for stemming.; matchCutOff - The cutoff score for matching variants and the original term.; maxEntries - The maximum number of entries to return. If zero or negative, return all possible entries.; timeLimit - The maximum amount of time (in milliseconds) to spend trying to find matches. If zero or negative, no time limit is imposed.
Returns:: An array of Term objects containing the matching entries, or null if there are not such entries.

getFieldFiles

protected java.io.File[] getFieldFiles()

Gets the files associated with the field store for a partition.

Returns:: an array of files. The first is for the dictionary, and the remaining are for the postings files.

getAllFiles

protected java.io.File[] getAllFiles()

Gets all the files associated with a partition, including those specific to the inverted file.

Overrides:: getAllFiles in class Partition

Returns:: an array of files

getAllFiles

protected static java.io.File[] getAllFiles(PartitionManager manager,
                                            int partNumber)

Gets all the files associated with a partition, including those specific to the inverted file.

Returns:: an array of files

reap

protected static void reap(PartitionManager m,
                           int n)

Reaps the given partition. If the postings file cannot be removed, then we return control immediately.

Parameters:: m - The manager associated with the partition.; n - The partition number to reap.

getBigramFiles

protected java.io.File[] getBigramFiles()

Gets the files associated with the bigram postings for a partition.

Returns:: an array of files. The first is for the dictionary, and the remaining are for the postings files.

getTaxonomy

public DiskTaxonomy getTaxonomy()

Returns:: Returns the taxonomy.

export

public void export(java.io.PrintWriter o)

Exports the data in this partition to an XML file format.

Parameters:: o - the writer to which the data will be output.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.sun.labs.minion.indexer.partition Class InvFileDiskPartition

fieldStoreDictFactory

bigramDictFactory

bigramDict

taxonomy

bigramDictOffset

fields

ngrams

bigramDictFile

bigramPostFile

fieldDictFile

fieldPostFile

logTag

InvFileDiskPartition

initAll

initBigramDict

initFields

initTaxonomy

getSavedFieldData

getSavedFieldData

getSavedFieldData

getSavedFieldData

getSavedFieldData

getSavedFields

getFieldIterator

getFieldIterator

getMatchingIterator

getSubstringIterator

getFieldPostings

getFieldCount

getFieldStore

getFieldSize

getSubsumed

euclideanDistance

mergeCustom

close

getMatching

getSpellingVariants

getSubstring

getStemMatches

getStemMatches

getFieldFiles

getAllFiles

getAllFiles

reap

getBigramFiles

getTaxonomy

export

com.sun.labs.minion.indexer.partition
Class InvFileDiskPartition