|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.sun.labs.minion.indexer.dictionary.DiskDictionary
public class DiskDictionary
A base class for all classes that implement dictionaries for use during querying.
| Nested Class Summary | |
|---|---|
class |
DiskDictionary.DiskDictionaryIterator
A class that can be used as an iterator for this block. |
protected class |
DiskDictionary.HE
A class that will act as a heap entry for merging. |
class |
DiskDictionary.LightDiskDictionaryIterator
A lightweight iterator for this dictionary. |
class |
DiskDictionary.LookupState
A class that can be used to encapsulate the dictionary state when doing multiple lookups during querying. |
| Field Summary | |
|---|---|
static int |
CHANNEL_FULL_POST
An integer indicating that we should use channel postings inputs. |
static int |
CHANNEL_PART_POST
Use a file channel and partially load postings |
protected NameDecoder |
decoder
A decoder for the names in this dictionary. |
protected DictionaryHeader |
dh
The header for the dictionary. |
protected java.io.RandomAccessFile |
dictFile
The dictionary file. |
protected java.lang.Class |
entryClass
The type of entry that we contain. |
protected ReadableBuffer |
entryInfo
The information for the entries. |
protected ReadableBuffer |
entryInfoOffsets
The offsets for the entry information. |
static int |
FILE_FULL_POST
Use a random access file and fully load postings. |
static int |
FILE_PART_POST
Use a random access file and partially load postings. |
protected ReadableBuffer |
idToPosn
The map from entry IDs to positions in the dictionary. |
protected static java.lang.String |
logTag
The tag for this module. |
protected LRACache<java.lang.Object,QueryEntry> |
nameCache
A cache from entry name to a query entry. |
protected ReadableBuffer |
nameOffsets
The offsets of the names of the uncompressed entries. |
protected ReadableBuffer |
names
The entry names. |
int |
nLoads
|
protected Partition |
part
The partition that we are associated with. |
protected LRACache<java.lang.Integer,java.lang.Object> |
posnCache
A cache from position to entry names used during binary searches for terms. |
protected java.io.RandomAccessFile[] |
postFiles
The postings files. |
protected PostingsInput[] |
postIn
Our postings inputs. |
long |
totalSize
|
| Constructor Summary | |
|---|---|
protected |
DiskDictionary()
Creates an dict |
|
DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles)
Creates a disk dictionary that we can use for querying. |
|
DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
int postInType,
int cacheSize,
int nameBufferSize,
int offsetsBufferSize,
int infoBufferSize,
int infoOffsetsBufferSize,
Partition part)
Creates a disk dictionary that we can use for querying. |
|
DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
int postInType,
Partition part)
Creates a disk dictionary that we can use for querying. |
|
DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
Partition part)
Creates a disk dictionary that we can use for querying. |
| Method Summary | |
|---|---|
protected void |
customSetup(IndexEntry me,
QueryEntry e,
int start,
int[] postIDMap)
Do any custom setup required for merging one entry onto another. |
protected QueryEntry |
find(int posn,
DiskDictionary.LookupState lus)
Finds the entry at the given position in this dictionary. |
protected int |
findPos(java.lang.Object key,
DiskDictionary.LookupState lus)
Determines the position at which a entry falls. |
protected int |
findPos(java.lang.Object key,
DiskDictionary.LookupState lus,
boolean partial)
Determines the position within this block at which a entry falls. |
QueryEntry |
get(int id)
Gets a entry from the dictionary, given the ID for the entry. |
protected QueryEntry |
get(int id,
DiskDictionary.LookupState lus)
Gets a entry from the dictionary, given the ID for the entry. |
QueryEntry |
get(java.lang.Object name)
Gets a entry from the dictionary, given the name for the entry. |
QueryEntry |
get(java.lang.Object name,
DiskDictionary.LookupState lus)
Gets a entry from the dictionary, given the name for the entry. |
protected PostingsInput[] |
getBufferedInputs()
Gets a buffered version of the postings inputs for this dictionary so that we can stream a bit better through the postings when doing, for example, dictionary merges. |
protected PostingsInput[] |
getBufferedInputs(int buffSize)
Gets a buffered version of the postings inputs for this dictionary so that we can stream a bit better through the postings when doing, for example, dictionary merges. |
DiskDictionary.LookupState |
getLookupState()
|
QueryEntry[] |
getMatching(DiskBiGramDictionary biDict,
java.lang.String pat,
boolean caseSensitive,
int maxEntries,
long timeLimit)
Gets the entries matching the given pattern from the given dictionary. |
int |
getMaxID()
Gets the maximum ID in the dictionary. |
Partition |
getPartition()
Gets the partition to which this dictionary belongs. |
QueryEntry[] |
getSpellingVariants(DiskBiGramDictionary biDict,
java.lang.String word,
boolean caseSensitive,
int maxEntries,
long timeLimit)
Gets the list of possible spelling corrections, based on terms in the index, for the string that is passed in. |
QueryEntry[] |
getStemMatches(DiskBiGramDictionary biDict,
java.lang.String term,
boolean caseSensitive,
int minLen,
float matchCutOff,
int maxEntries,
long timeLimit)
Gets a set of all the entries with the given stem |
QueryEntry[] |
getSubstring(DiskBiGramDictionary biDict,
java.lang.String substring,
boolean caseSensitive,
boolean starts,
boolean ends,
int maxEntries,
long timeLimit)
Gets the entries matching the given pattern from the given dictionary. |
DictionaryIterator |
iterator()
Gets an iterator for this dictionary. |
DictionaryIterator |
iterator(int begin,
int end)
Gets an iterator for the dictionary that starts and stops at the given indices in the dictionary. |
DictionaryIterator |
iterator(java.lang.Object startEntry,
boolean includeStart)
Creates an iterator that starts iterating at the specified entry, or, if the entry does not exist in the block, starts iterating at the first entry greater than the provided entry. |
DictionaryIterator |
iterator(java.lang.Object startEntry,
boolean includeStart,
java.lang.Object stopEntry,
boolean includeStop)
Creates an iterator that starts iterating at the specified startEntry and stops iterating at
the specified stopEntry. |
LightIterator |
literator()
|
int[][] |
merge(IndexEntry entryFactory,
NameEncoder encoder,
DiskDictionary[] dicts,
EntryMapper[] mappers,
int[] starts,
int[][] postIDMaps,
java.io.RandomAccessFile mDictFile,
PostingsOutput[] postOut,
boolean appendPostings)
Merges a number of dictionaries into a single dictionary. |
int[][] |
merge(IndexEntry entryFactory,
NameEncoder encoder,
PartitionStats partStats,
DiskDictionary[] dicts,
EntryMapper[] mappers,
int[] starts,
int[][] postIDMaps,
java.io.RandomAccessFile mDictFile,
PostingsOutput[] postOut,
boolean appendPostings)
Merges a number of dictionaries into a single dictionary. |
QueryEntry |
newEntry(java.lang.Object name)
Gets an instance of the kind of entries stored in this dictionary. |
protected QueryEntry |
newEntry(java.lang.Object name,
int posn,
DiskDictionary.LookupState lus,
PostingsInput[] postIn)
Creates a new entry and fills in its information. |
IndexEntry |
put(java.lang.Object name,
IndexEntry t)
Puts a entry into the dictionary. |
void |
remapPostings(IndexEntry entryFactory,
NameEncoder encoder,
PartitionStats partStats,
int[] postMap,
java.io.RandomAccessFile dictFile,
PostingsOutput[] postOut)
Rewrites this dictionary to the files passed in while remapping IDs in the postings to the new IDs passed in. |
void |
setCacheSize(int s)
Sets the sizes of the name and position cache. |
void |
setPartition(Partition p)
Sets the partition with which this dictionary is associated |
protected void |
setUpBuffers(int nameBufferSize,
int offsetsBufferSize,
int infoBufferSize,
int infoOffsetsBufferSize)
|
int |
size()
Gets the size of the dictionary. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public long totalSize
public int nLoads
protected DictionaryHeader dh
protected java.lang.Class entryClass
protected ReadableBuffer idToPosn
protected ReadableBuffer names
protected ReadableBuffer nameOffsets
protected ReadableBuffer entryInfo
protected ReadableBuffer entryInfoOffsets
protected LRACache<java.lang.Integer,java.lang.Object> posnCache
protected LRACache<java.lang.Object,QueryEntry> nameCache
protected NameDecoder decoder
protected java.io.RandomAccessFile dictFile
protected java.io.RandomAccessFile[] postFiles
protected PostingsInput[] postIn
protected Partition part
protected static java.lang.String logTag
public static final int CHANNEL_FULL_POST
public static final int CHANNEL_PART_POST
public static final int FILE_FULL_POST
public static final int FILE_PART_POST
| Constructor Detail |
|---|
protected DiskDictionary()
public DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles)
throws java.io.IOException
entryClass - The class of the entries that the dictionary
contains.decoder - A decoder for the names in this dictionary.dictFile - The file containing the dictionary.postFiles - The files containing the postings associated with
the entries in this dictionary.
java.io.IOException - if there is any error opening the dictionary
public DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
Partition part)
throws java.io.IOException
entryClass - The class of the entries that the dictionary
contains.decoder - A decoder for the names in this dictionary.dictFile - The file containing the dictionary.postFiles - The files containing the postings associated with
the entries in this dictionary.part - The partition with which this dictionary is associated.
java.io.IOException - if there is any error opening the dictionary
public DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
int postInType,
Partition part)
throws java.io.IOException
entryClass - The class of the entries that the dictionary
contains.decoder - A decoder for the names in this dictionary.dictFile - The file containing the dictionary.postFiles - The files containing the postings associated with
the entries in this dictionary.postInType - The type of postings input to use.part - The partition with which this dictionary is associated.
java.io.IOException - if there is any error opening the dictionary
public DiskDictionary(java.lang.Class entryClass,
NameDecoder decoder,
java.io.RandomAccessFile dictFile,
java.io.RandomAccessFile[] postFiles,
int postInType,
int cacheSize,
int nameBufferSize,
int offsetsBufferSize,
int infoBufferSize,
int infoOffsetsBufferSize,
Partition part)
throws java.io.IOException
nameBufferSize - the size of the buffer (in bytes) to use for the entry namesoffsetsBufferSize - the size of the buffer (in bytes) to use for the entry name offsetsinfoBufferSize - the size of the buffer (in bytes) to use for the entry informationinfoOffsetsBufferSize - the size of the buffer (in bytes) to use for the entry information offsetsentryClass - The class of the entries that the dictionary
contains.decoder - A decoder for the names in this dictionary.dictFile - The file containing the dictionary.postFiles - The files containing the postings associated with
the entries in this dictionary.postInType - The type of postings input to use.cacheSize - The number of entries to use in the name and
position caches.part - The partition with which this dictionary is associated.
java.io.IOException - if there is any error opening the dictionary| Method Detail |
|---|
protected void setUpBuffers(int nameBufferSize,
int offsetsBufferSize,
int infoBufferSize,
int infoOffsetsBufferSize)
throws java.io.IOException
java.io.IOExceptionpublic void setCacheSize(int s)
s - The size of the caches to use.public int getMaxID()
public IndexEntry put(java.lang.Object name,
IndexEntry t)
put in interface Dictionaryname - the name of the entry to put in the dictionaryt - The entry to put in the dictionary.
nullpublic DiskDictionary.LookupState getLookupState()
public QueryEntry get(java.lang.Object name)
get in interface Dictionaryname - The name of the entry to get.
null if
the name doesn't appear in the dictionary.
public QueryEntry get(java.lang.Object name,
DiskDictionary.LookupState lus)
name - The name of the entry to get.lus - a lookup state for this dictionary. A lookup state can be
re-used when doing multiple lookups to save time. If this parameter is
null, a lookup state will be generated for each lookup.
null if
the name doesn't appear in the dictionary.public QueryEntry get(int id)
id - the ID to find.
null if the ID doesn't occur in
our dictionary.
protected QueryEntry get(int id,
DiskDictionary.LookupState lus)
id - the ID to find.lus - the current lookup state
null if the ID doesn't occur in
our dictionary.
protected int findPos(java.lang.Object key,
DiskDictionary.LookupState lus)
key - the name of the entry to findlus - a lookup state that carries around copies of the buffers holding
the dictionary data
(-(location) -1). location is defined
as the location of the first entry in the block "greater than" the
given entry. If all entries are "less than" the given entry, the
size of this block will be returned. Note that this guarantees
that the return value will be >= 0 if and only if the given entry
is found in the block.
protected int findPos(java.lang.Object key,
DiskDictionary.LookupState lus,
boolean partial)
key - the name of the entry to findlus - a lookup state variable that contains local copies of the
dictionary's bufferspartial - if true, treat key as a stem and return as soon
as a partial match (one that begins with the stem) is found
(-(location) -1). location is defined
as the location of the first entry in the block "greater than" the
given entry. If all entries are "less than" the given entry, the
size of this block will be returned. Note that this guarantees
that the return value will be >= 0 if and only if the given entry
is found in the block.
protected QueryEntry find(int posn,
DiskDictionary.LookupState lus)
lus - the current state of the lookup
null if
there is no entry at that position in this block.
protected QueryEntry newEntry(java.lang.Object name,
int posn,
DiskDictionary.LookupState lus,
PostingsInput[] postIn)
name - The name of the entry to be filled.posn - The position of this entry in the dictionary.lus - A lookup state containing copies of the dictionary's datapostIn - The postings channels to use for reading postings.
public QueryEntry[] getStemMatches(DiskBiGramDictionary biDict,
java.lang.String term,
boolean caseSensitive,
int minLen,
float matchCutOff,
int maxEntries,
long timeLimit)
biDict - The bigram dictionary to use for the lookup.term - the stem to look forcaseSensitive - If true, then we should only return
entries that match the case of the pattern.minLen - The minimum length that we'll consider for a stem.matchCutOff - the cutoff score for matching the variants
to the original entrymaxEntries - the maximum number of entries to provide; returns
all entries if maxEntries is non-positivetimeLimit - The maximum amount of time (in milliseconds) to
spend trying to find matches. If zero or negative, no time limit is
imposed.
public QueryEntry[] getMatching(DiskBiGramDictionary biDict,
java.lang.String pat,
boolean caseSensitive,
int maxEntries,
long timeLimit)
biDict - The bigrams to use to do the candidate entry
selection.pat - The pattern to match entries against.caseSensitive - If true, then we should only return
entries that match the case of the pattern.maxEntries - The maximum number of entries to return. If zero
or negative, return all possible entries.timeLimit - The maximum amount of time (in milliseconds) to
spend trying to find matches. If zero or negative, no time limit is
imposed.
Entry objects containing the
matching entries, or null if there are not such entries, or an array
of length zero if the operation timed out before any entries could
be matched
public QueryEntry[] getSpellingVariants(DiskBiGramDictionary biDict,
java.lang.String word,
boolean caseSensitive,
int maxEntries,
long timeLimit)
biDict - The bigrams to use to do the candidate entry
selection.word - the word to find alternates forcaseSensitive - If true, then we should only return
entries that match the case of the pattern.maxEntries - The maximum number of entries to return. If zero
or negative, return all possible entries.timeLimit - The maximum amount of time (in milliseconds) to
spend trying to find matches. If zero or negative, no time limit is
imposed.
Entry objects containing the
matching entries, or null if there are not such entries, or an array
of length zero if the operation timed out before any entries could
be matched
public QueryEntry[] getSubstring(DiskBiGramDictionary biDict,
java.lang.String substring,
boolean caseSensitive,
boolean starts,
boolean ends,
int maxEntries,
long timeLimit)
biDict - A dictionary of bigrams built from the entries that are
in this dictionary.substring - The substring to find in the entries.caseSensitive - If true, then we should look for
matches that match the case of the letters in the substring.starts - If true, the value must start with the
given substring.ends - If true, the value must end with the given
substring.maxEntries - The maximum number of entries to return. If zero or
negative, return all possible entries.timeLimit - The maximum amount of time (in milliseconds) to
spend trying to find matches. If zero or negative, no time limit is
imposed.
Entry objects containing the
matching entries, or null if there are not such entries, or an
array of length zero if the operation timed out before any
entries could be matchedpublic int size()
size in interface Dictionarypublic QueryEntry newEntry(java.lang.Object name)
name - The name of the entry.
null if there is
some exception thrown while instantiating the entry. Any such
exceptions will be logged as errors.public Partition getPartition()
getPartition in interface Dictionarypublic void setPartition(Partition p)
p - the partition with which the dictionary is associatedprotected PostingsInput[] getBufferedInputs()
protected PostingsInput[] getBufferedInputs(int buffSize)
buffSize - the size of the buffer to use
public DictionaryIterator iterator()
iterator in interface Dictionaryiterator in interface java.lang.Iterable<QueryEntry>Map.Entry interfacepublic LightIterator literator()
public DictionaryIterator iterator(java.lang.Object startEntry,
boolean includeStart)
startEntry - the name of the entry to start iterating atincludeStart - If true, then the iterator
will return startEntry, if it is in the dictionary.
public DictionaryIterator iterator(java.lang.Object startEntry,
boolean includeStart,
java.lang.Object stopEntry,
boolean includeStop)
startEntry and stops iterating at
the specified stopEntry. If either entry does not
exist in the block, the first entry greater than the entry provided
will be used.
startEntry - the name of the entry to start iterating at, or
null to start at the first entryincludeStart - If true, then the iterator
will return startEntry, if it is in the dictionary.stopEntry - the name of the entry to stop iterating at, or
null to stop after the last entryincludeStop - if true and
stopEntry is non-null, then the
iterator will return stopEntry, if it is in the
dictionary.
public DictionaryIterator iterator(int begin,
int end)
begin - the beginning index in the dictionary, counting from
0. If this value is less than zero it will be clamped to zero.end - the ending index in the dictionary, counting from 0. If
this value is greater than the number of entries in the dictionary,
it will be limited to that number.
Map.Entry interface
public int[][] merge(IndexEntry entryFactory,
NameEncoder encoder,
DiskDictionary[] dicts,
EntryMapper[] mappers,
int[] starts,
int[][] postIDMaps,
java.io.RandomAccessFile mDictFile,
PostingsOutput[] postOut,
boolean appendPostings)
throws java.io.IOException
entryFactory - An index entry that we can use to generate
entries for the merged dictionary.encoder - An encoder for the names in this dictionary.dicts - The dictionaries to merge.mappers - A set of entry mappers that will be applied to the
dictionaries as entries are considered for the merge. If this
parameter is null, then the entries in the merged
dictionary will be renumbered in order of increasing name.starts - The starting IDs for the new partition.postIDMaps - Maps from old to new IDs for the IDs in our
postings.mDictFile - The file where the merged dictionary will be
written.postOut - The output where the postings for the merged
dictionary will be writtenappendPostings - true if postings should be appended rather than
merged
java.io.IOException - when there is an error during the merge.
public int[][] merge(IndexEntry entryFactory,
NameEncoder encoder,
PartitionStats partStats,
DiskDictionary[] dicts,
EntryMapper[] mappers,
int[] starts,
int[][] postIDMaps,
java.io.RandomAccessFile mDictFile,
PostingsOutput[] postOut,
boolean appendPostings)
throws java.io.IOException
entryFactory - An index entry that we can use to generate
entries for the merged dictionary.encoder - An encoder for the names in this dictionary.partStats - a set of partition statistics to which we'll add
during the merge.dicts - The dictionaries to merge.mappers - A set of entry mappers that will be applied to the
dictionaries as entries are considered for the merge. If this
parameter is null, then the entries in the merged
dictionary will be renumbered in order of increasing name.starts - The starting IDs for the new partition.postIDMaps - Maps from old to new IDs for the IDs in our
postings.mDictFile - The file where the merged dictionary will be
written.postOut - The output where the postings for the merged
dictionary will be writtenappendPostings - true if postings should be appended rather than
merged
java.io.IOException - when there is an error during the merge.
public void remapPostings(IndexEntry entryFactory,
NameEncoder encoder,
PartitionStats partStats,
int[] postMap,
java.io.RandomAccessFile dictFile,
PostingsOutput[] postOut)
throws java.io.IOException
entryFactory - factory to create new entriesencoder - name encoder to write new entriespartStats - the partition stats for this partitionpostMap - a mapping from old postings ids to new idsdictFile - a file to write the new dictionary inpostOut - a set of files to write the new postings in
java.io.IOException - if there is any error reading or writing dictionaries
protected void customSetup(IndexEntry me,
QueryEntry e,
int start,
int[] postIDMap)
me - the entry onto which we're merging postings.e - the entry which we are about to merge into the
merged entry.start - the new starting ID for documents for the partition from
which the entry we're going to merge are drawn.postIDMap - a map from old to new IDs for the postings that
we're about to merge.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||