com.sun.labs.minion.indexer.postings
Class DFOPostings

java.lang.Object
  extended by com.sun.labs.minion.indexer.postings.DFOPostings
All Implemented Interfaces:
Postings

public class DFOPostings
extends java.lang.Object
implements Postings

A postings class for storing IDs, frequencies, and field and word position information. The data is encoded into two buffers.

The first buffer contains document ID and frequency information for each document and an offset into the second buffer where field and position information is stored for a particular document.

  1. The number of IDs in the postings is byte encoded.
  2. The last ID in the postings list is byte encoded.
  3. The last offset in the postings list is byte encoded.
  4. The number of skips in the skip table is byte encoded.
  5. The skip table. The number of entries per skip is dependent on the application. The skip table has the following structure.
    1. The number of skips is byte encoded.
    2. For each skip we encode:
      1. The ID of skipped entry. This is byte encoded as a delta from the previous ID in the skip table
      2. The position in the encoded data to skip to. This is byte encoded as a delta from the previous position in the skip table. Note that this position is relative to the end of the skip table!
      3. The offset into the second buffer for this document.
  6. For each document we encode:
    1. The ID of the document, byte encoded as a delta from the previous ID in the postings list.
    2. The term frequency for this ID, byte encoded as is.
    3. The offset in the second buffer where the position and field information can be found for this document.

The second buffer contains encoded field and position information. For each document, the data is structured in the following way:

  1. The number of fields for the document is byte encoded
  2. For each field we encode:
    1. The field ID is byte encoded
    2. The number of occurrences of the field is byte encoded.
    3. The word position of each occurrence, byte encoded as a series of deltas.


Nested Class Summary
 class DFOPostings.DFOIterator
           
 
Field Summary
protected  boolean appending
          Whether we're building these postings by appending.
protected  int dataStart
          The position in the compressed representation where the data starts.
protected  Buffer dfo
          The compressed document and frequency postings.
protected  int[] ffreq
          The field frequency information.
protected  Buffer fnp
          The compressed field and position information.
protected  WriteableBuffer[] fposn
          The field position information.
protected  int freq
          The frequency of current ID.
protected  int lastID
          The last ID in this postings list.
protected  int lastOff
          The last positions offset in this postings list.
protected static java.lang.String logTag
           
protected  int maxfdt
          The maximum frequency encountered in the postings.
protected  int nFields
          The number of fields in the current documents.
protected  int nIDs
          The number of IDs in the postings.
protected  int nSkips
          The number of skips in the skip table.
protected  int[] prevFPosn
          The previous field positions.
protected  int prevID
          The previous ID encountered during indexing.
protected  int[] skipID
          The IDs in the skip table.
protected  int[] skipOff
          The offsets in the skip table.
protected  int[] skipPos
          The positions in the skip table.
protected  int skipSize
          The number of documents in a skip.
protected  int splitPoint
          After getting the buffers, this member will contain the split point between the buffers for the documents, frequencies, and offsets and the buffers for the field and position information.
protected  long to
          The total number of occurrences in the postings.
 
Constructor Summary
DFOPostings()
          Makes a postings entry that is useful during indexing.
DFOPostings(ReadableBuffer input)
          Makes a postings entry that is useful during querying.
DFOPostings(ReadableBuffer input, int offset, int size, int fnpSize)
          Makes a postings entry that is useful during querying.
DFOPostings(ReadableBuffer b1, ReadableBuffer b2)
           
 
Method Summary
 void add(Occurrence o)
          Adds an occurrence to the postings list.
protected  void addFields(FieldOccurrence fo)
          Adds an occurrence to all relevant fields.
protected  void addSkip(int id, int pos, int off)
          Adds a skip to the skip table.
 void append(Postings p, int start)
          Appends another set of postings to this one.
 void append(Postings p, int start, int[] idMap)
          Appends another set of postings to this one, removing any data associated with deleted documents.
protected  int encode()
          Encodes the data for a single ID.
protected  int encodeBasic()
           
 void finish()
          Finishes off the encoding by adding any data that we collected for the last document.
 WriteableBuffer[] getBuffers()
          Gets a ByteBuffer whose contents represent the postings.
 int getLastID()
          Gets the last ID in the postings list.
 int getMaxFDT()
          Gets the maximum frequency in the postings list.
 int getN()
          Gets the number of IDs in the postings list.
 long getTotalOccurrences()
          Gets the total number of occurrences in the postings list.
protected  void init(ReadableBuffer b1, ReadableBuffer b2)
           
 PostingsIterator iterator(PostingsIteratorFeatures features)
          Gets an iterator for the postings.
 void remap(int[] idMap)
          Remaps the IDs in this postings list according to the given old-to-new ID map.
 void setSkipSize(int size)
          Sets the skip size.
 int size()
          Gets the size of the postings, in bytes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

dfo

protected Buffer dfo
The compressed document and frequency postings.


fnp

protected Buffer fnp
The compressed field and position information.


appending

protected boolean appending
Whether we're building these postings by appending.


nIDs

protected int nIDs
The number of IDs in the postings.


to

protected long to
The total number of occurrences in the postings.


maxfdt

protected int maxfdt
The maximum frequency encountered in the postings.


splitPoint

protected int splitPoint
After getting the buffers, this member will contain the split point between the buffers for the documents, frequencies, and offsets and the buffers for the field and position information.


prevID

protected int prevID
The previous ID encountered during indexing.


lastID

protected int lastID
The last ID in this postings list.


lastOff

protected int lastOff
The last positions offset in this postings list.


freq

protected int freq
The frequency of current ID.


nFields

protected int nFields
The number of fields in the current documents.


ffreq

protected int[] ffreq
The field frequency information.


prevFPosn

protected int[] prevFPosn
The previous field positions.


fposn

protected WriteableBuffer[] fposn
The field position information.


skipID

protected int[] skipID
The IDs in the skip table.


skipOff

protected int[] skipOff
The offsets in the skip table.


skipPos

protected int[] skipPos
The positions in the skip table.


nSkips

protected int nSkips
The number of skips in the skip table.


dataStart

protected int dataStart
The position in the compressed representation where the data starts.


skipSize

protected int skipSize
The number of documents in a skip.


logTag

protected static java.lang.String logTag
Constructor Detail

DFOPostings

public DFOPostings()
Makes a postings entry that is useful during indexing.


DFOPostings

public DFOPostings(ReadableBuffer input)
Makes a postings entry that is useful during querying.

Parameters:
input - the data read from a postings file.

DFOPostings

public DFOPostings(ReadableBuffer input,
                   int offset,
                   int size,
                   int fnpSize)
Makes a postings entry that is useful during querying.

Parameters:
input - the data read from a postings file.
offset - The offset in the buffer from which we should start reading. If this value is greater than 0, then we need to share the bit buffer, since we may be part of a larger postings entry that will need multiple readers.
size - The size of the data in the sub-buffer.

DFOPostings

public DFOPostings(ReadableBuffer b1,
                   ReadableBuffer b2)
Method Detail

init

protected void init(ReadableBuffer b1,
                    ReadableBuffer b2)

addSkip

protected void addSkip(int id,
                       int pos,
                       int off)
Adds a skip to the skip table.

Parameters:
id - The ID that the skip is pointing to.
pos - The position in the postings to skip to.

encodeBasic

protected int encodeBasic()

encode

protected int encode()
Encodes the data for a single ID. This is a delta from the previous ID, the frequency, and a delta from the previous position offset.


setSkipSize

public void setSkipSize(int size)
Sets the skip size.

Specified by:
setSkipSize in interface Postings

add

public void add(Occurrence o)
Adds an occurrence to the postings list.

Specified by:
add in interface Postings
Parameters:
o - The occurrence.

addFields

protected void addFields(FieldOccurrence fo)
Adds an occurrence to all relevant fields.

Parameters:
fo - an occurrence that includes information about what fields are currently active.

getN

public int getN()
Gets the number of IDs in the postings list.

Specified by:
getN in interface Postings
Returns:
the number of IDs in the postings list.

getLastID

public int getLastID()
Description copied from interface: Postings
Gets the last ID in the postings list.

Specified by:
getLastID in interface Postings

getMaxFDT

public int getMaxFDT()
Gets the maximum frequency in the postings list.

Specified by:
getMaxFDT in interface Postings
Returns:
the maximum frequency across all of the postings stored in this postings list.

getTotalOccurrences

public long getTotalOccurrences()
Gets the total number of occurrences in the postings list.

Specified by:
getTotalOccurrences in interface Postings
Returns:
The total number of occurrences associated with these postings.

finish

public void finish()
Finishes off the encoding by adding any data that we collected for the last document.

Specified by:
finish in interface Postings

size

public int size()
Gets the size of the postings, in bytes.

Specified by:
size in interface Postings

getBuffers

public WriteableBuffer[] getBuffers()
Gets a ByteBuffer whose contents represent the postings. These buffers can safely be written to streams. The format is as follows: NumIDs:LastID:NumSkipEntries[:skipID:skipPos]*:

Specified by:
getBuffers in interface Postings
Returns:
A ByteBuffer containing the encoded postings data.

remap

public void remap(int[] idMap)
Remaps the IDs in this postings list according to the given old-to-new ID map.

This is tricky, because we can't assume that the remapped IDs will maintain the order of the IDs, even if the IDs have changed. Thus, we need to uncompres all of the IDs and then put them back together.

Specified by:
remap in interface Postings
Parameters:
idMap - A map from the IDs currently in use in the postings to new IDs.

append

public void append(Postings p,
                   int start)
Appends another set of postings to this one.

Specified by:
append in interface Postings
Parameters:
p - The postings to append. Implementers can safely assume that the postings being passed in are of the same class as the implementing class.
start - The new starting document ID for the partition that the entry was drawn from.

append

public void append(Postings p,
                   int start,
                   int[] idMap)
Appends another set of postings to this one, removing any data associated with deleted documents.

Specified by:
append in interface Postings
Parameters:
p - The postings to append. Implementers can safely assume that the postings being passed in are of the same class as the implementing class.
start - The new starting document ID for the partition that the entry was drawn from.
idMap - A map from old IDs in the given postings to new IDs with gaps removed for deleted data. If this is null, then there are no deleted documents.

iterator

public PostingsIterator iterator(PostingsIteratorFeatures features)
Gets an iterator for the postings.

Specified by:
iterator in interface Postings
Parameters:
features - A set of features that the iterator must support.
Returns:
A postings iterator. The iterators for these postings do not support any of the extra features available. If any extra features are requested, a warning will be logged and null will be returned.