com.sun.labs.minion.engine
Class SearchEngineImpl

java.lang.Object
  extended by com.sun.labs.minion.engine.SearchEngineImpl
All Implemented Interfaces:
Classifier, SearchEngine, Searcher, com.sun.labs.util.props.Component, com.sun.labs.util.props.Configurable

public class SearchEngineImpl
extends java.lang.Object
implements SearchEngine, com.sun.labs.util.props.Configurable

This is the main class for handling a search engine, both for indexing and retrieval operations. The engine is configured by two sets of properties: indexing properties and query properties. The valid set properties for indexing can be found in the documentation for the IndexConfig class. The valid set of query properties can be found in the documentation for the QueryConfig class.

Indexing Documents

Indexing is done by pipelines. In the index configuration you can specify the number of pipelines that the engine should create. By creating more than one pipeline, you can index documents in parallel. There are two kinds of pipeline: synchronous and asynchronous.

A synchronous pipeline is one which blocks the caller until indexing of the document has been completed. An asynchronous pipeline contains a queue of documents to index, and the caller will not be blocked during the indexing of a document unless the queue is full. Once a document has been added to the indexing queue, control returns to the caller. Note that in the case of asynchronous indexing, the map containing your document may sit on the indexing queue for some time, so you should not attempt to change or re-use that map!

To index documents, you set up the engine using a set of index configuration properties and then simply call the SearchEngine.index(java.lang.String, java.util.Map) method. This will route the document to a pipeline that is ready to index. Once you've indexed all of your documents, you can call the SearchEngine.flush() method to make sure that all of your indexed data is written to the disk.


Field Summary
protected  ClassifierManager classManager
          The manager for the classifier partitions in this index.
protected  ClassifierMemoryPartition classMemoryPartition
          The memory partition for building classifiers
protected  ClusterManager clusterManager
          The manager for the cluster partitions in this index.
protected  ClusterMemoryPartition clusterMemoryPartition
          The memory partition for building feature clusters
protected  com.sun.labs.util.props.ConfigurationManager cm
          The configuration manager for this engine.
protected static java.text.DecimalFormat form
          A format object for formatting the output.
protected  IndexConfig indexConfig
          The configuration for the index and the indexing engine.
protected  java.util.concurrent.BlockingQueue indexingQueue
          A blocking queue upon which we can put indexable things.
protected  PartitionManager invFilePartitionManager
          The manager for the partitions in this index.
protected static java.lang.String logTag
          Our log tag.
protected  MetaDataStoreImpl metaDataStore
          The meta data storage for this engine/index
protected  Pipeline[] pipes
          The pipelines to use for indexing.
protected  java.lang.Thread[] pipeThreads
          Threads to hold run our pipelines.
static java.lang.String PROP_BUILD_CLASSIFIERS
          A property indicating whether we should build classifiers while indexing or not.
static java.lang.String PROP_CLASS_MANAGER
           
static java.lang.String PROP_CLASS_MEMORY_PARTITION
           
static java.lang.String PROP_CLASSIFIER_CLASS_NAME
           
static java.lang.String PROP_CLUSTER_MANAGER
           
static java.lang.String PROP_CLUSTER_MEMORY_PARTITION
           
static java.lang.String PROP_DUMPER
           
static java.lang.String PROP_INDEX_CONFIG
           
static java.lang.String PROP_INDEXING_QUEUE_LENGTH
           
static java.lang.String PROP_INV_FILE_PARTITION_MANAGER
           
static java.lang.String PROP_LONG_INDEXING_RUN
          A property that indicates that the search engine will be used for a long indexing run with no querying going on during that time.
static java.lang.String PROP_MIN_MEMORY_PERCENT
           
static java.lang.String PROP_NUM_PIPELINES
           
static java.lang.String PROP_PIPELINE_FACTORY
           
static java.lang.String PROP_PROFILERS
           
static java.lang.String PROP_QUERY_CONFIG
           
protected  QueryConfig queryConfig
          The configuration for the query engine.
 
Fields inherited from interface com.sun.labs.minion.Searcher
GRAMMAR_LUCENE, GRAMMAR_STRICT, GRAMMAR_WEB, GRAMMARS, OP_AND, OP_OR, OP_PAND
 
Constructor Summary
SearchEngineImpl()
          Gets a search engine implementation.
 
Method Summary
 void addIndexListener(IndexListener il)
          Adds a listener for events in the index backing this search engine.
 void addQueryStats(QueryStats qs)
           
 ResultSet allTerms(java.util.Collection<java.lang.String> terms, java.util.Collection<java.lang.String> fields)
          Builds a result set containing all of the given terms in any of the given fields.
 ResultSet anyTerms(java.util.Collection<java.lang.String> terms, java.util.Collection<java.lang.String> fields)
          Builds a result set of the documents containing any of the given terms in any of the given fields.
 void checkDump()
           
 boolean checkLowMemory()
          Determines if available memory is low.
 void classify(java.lang.String[] docKeys, java.lang.String[] classNames)
          Creates a manual assignment of a set of documents to a set of classes.
 void close()
          Closes the engine.
 Document createDocument(java.lang.String key)
          Creates a new document with a given key.
 FieldInfo defineField(FieldInfo field)
          Defines a given field.
 void delete(java.util.List<java.lang.String> docs)
          Deletes a number of documents from the index.
 void delete(java.lang.String key)
          Deletes a document from the index.
protected  void dump()
          Dumps any data currently held in memory to the disk via our configured dumper.
 void export(java.io.PrintWriter o)
          Outputs an XML representation of the search index including all saved and vectored fields.
 void flush()
          Flushes the indexed material currently held in memory to the disk, making it available for searching.
 void flushClassifiers()
          Dumps all the classifiers that have been traied since the last dump, or since the searh engine started.
 java.util.List getAllFieldValues(java.lang.String field, java.lang.String key)
          Gets all of the field values associated with a given field in a given document.
 java.lang.String[] getClasses()
          Returns the names of the classes for which classifiers are defined.
 ClassifierModel getClassifier(java.lang.String name)
           
 ClassifierManager getClassifierManager()
          Gets the classifier manager for this search engine.
 ClusterManager getClusterManager()
          Gets the cluster manager for this search engine.
 com.sun.labs.util.props.ConfigurationManager getConfigurationManager()
           
 double getDistance(java.lang.String k1, java.lang.String k2, java.lang.String name)
          Gets the distance between two documents, based on the values stored in in a given feature vector saved field.
 Document getDocument(java.lang.String key)
          Gets a document with a given key.
 java.util.Iterator<Document> getDocumentIterator()
          Gets an iterator for all of the non-deleted documents in the collection.
 java.util.List<Document> getDocuments(java.util.List<java.lang.String> keys)
          Gets a list of documents with the given keys.
 DocKeyEntry getDocumentTerm(java.lang.String key)
           
 DocumentVector getDocumentVector(Document doc, java.lang.String field)
          Creates a document vector for the given document as though it occurred in the index.
 DocumentVector getDocumentVector(Document doc, WeightedField[] fields)
          Creates a composite document vector for the given document as though it occurred in the index.
 DocumentVector getDocumentVector(java.lang.String key)
          Gets a document vector for the given key.
 DocumentVector getDocumentVector(java.lang.String key, java.lang.String field)
          Gets a document vector for the given key.
 DocumentVector getDocumentVector(java.lang.String key, WeightedField[] fields)
          Gets a composite document vector for the given linear combination of vectored fields for the given key.
 FieldInfo getFieldInfo(java.lang.String name)
          Gets the information for a field.
 java.util.Iterator getFieldIterator(java.lang.String field)
          Gets an iterator for all the values in a field.
 java.util.Collection getFieldNames()
          Gets the names of all the fields known in the index
 java.lang.Object getFieldValue(java.lang.String field, java.lang.String key)
          Gets a single field value associated with a given field in a given document.
 HLPipeline getHLPipeline()
          Gets a pipeline that can be used for highlighting.
 IndexConfig getIndexConfig()
          Gets the index configuration in use by this search engine.
 boolean getLongIndexingRun()
          Indicates whether this search engine is being used for a long indexing run.
 PartitionManager getManager()
          Gets the partition manager associated with this search engine.
 java.util.SortedSet<FieldValue> getMatching(java.lang.String field, java.lang.String pattern)
          Gets the values for the given field that match the given pattern.
 MetaDataStore getMetaDataStore()
          Gets the MetaDataStore for this index.
 java.lang.String getName()
          Gets the name of this engine, if one has been assigned by the application.
 int getNDocs()
          Gets the number of documents that the index contains.
 PartitionManager getPM()
          Gets the partition manager for this search engine.
 java.util.List getProfilers()
           
 QueryConfig getQC()
           
 QueryConfig getQueryConfig()
          Gets the query configuration being used by this search engine.
 QueryStats getQueryStats()
          Gets the combined query stats for any queries run by the engine.
 ResultSet getResults(java.util.Collection<java.lang.String> keys)
          Gets a set of results corresponding to the document keys passed in.
 ResultSet getResults(java.util.Map<java.lang.String,java.lang.Float> keys)
          Gets a set of results corresponding to the document keys and scores passed in.
 ResultSet getSimilar(java.lang.String key, java.lang.String name)
          Gets a set of results ordered by similarity to the given document, calculated by computing the euclidean distance based on the feature vector stored in the given field.
 java.util.List<FieldValue> getSimilarClassifiers(java.lang.String cname, int n)
           
 java.util.List<WeightedFeature> getSimilarClassifierTerms(java.lang.String cname1, java.lang.String cname2, int n)
           
 SimpleIndexer getSimpleIndexer()
          Gets a simple indexer that can be used for simple indexing.
 TermStats getTermStats(java.lang.String term)
          Gets the collection level term statistics for the given term.
 java.util.Set<java.lang.String> getTermVariations(java.lang.String term)
          Gets the set of variations on a term that will be generated by default when searching for the term.
 java.util.List<FieldFrequency> getTopFieldValues(java.lang.String field, int n, boolean ignoreCase)
          Gets a list of the top n most frequent field values for a given named field.
 ResultSet getTrainingDocuments(java.lang.String className)
          Returns the set of documents that was used to train the classifier for the class with the provided class name.
 void index(Document document)
          Indexes a document into the database.
 void index(Indexable doc)
          Indexes a document into the database.
 void index(java.lang.String key, java.util.Map document)
          Indexes a document into the database.
 boolean isIndexed(java.lang.String key)
          Checks to see if a document is in the index.
 boolean merge()
          Performs a merge in the index, if one is necessary.
 void newProperties(com.sun.labs.util.props.PropertySheet ps)
           
 void optimize()
          Merges all of the partitions in the index into a single partition.
 void purge()
          Deletes all of the data in the index.
 void reclassifyIndex(java.lang.String className)
          Causes the engine to reclassify all documents against the classifier for the given class name.
 void recover()
          Attempts to recover the index after an unruly shutdown.
 void removeIndexListener(IndexListener il)
          Removes an index listener from the listeners.
 void resetQueryStats()
          Resets the query stats for the engine.
 ResultSet search(Element el)
          Runs a query against the index, returning a set of results.
 ResultSet search(Element el, java.lang.String sortOrder)
          Runs a query against the index, returning a set of results.
 ResultSet search(java.lang.String query)
          Runs a query against the index, returning a set of results.
 ResultSet search(java.lang.String query, java.lang.String sortOrder)
          Runs a query against the index, returning a set of results.
 ResultSet search(java.lang.String query, java.lang.String sortOrder, int defaultOperator, int grammar)
          Runs a query against the index, returning a set of results.
 void setDefaultFieldInfo(FieldInfo field)
          Sets the default field information to use when unknown fields are encountered during indexing.
 void setLongIndexingRun(boolean longIndexingRun)
          Sets the indicator that this is a long indexing run, in which case term statistics dictionaries and document vector lengths will not be calculated until the engine is shutdown.
 void setProfilers(java.util.List profilers)
           
 void setQueryConfig(QueryConfig queryConfig)
          Sets the query configuration to use for subsequent queries.
protected  double toMB(long x)
           
 java.lang.String toString()
          Gets a string description of the search engine.
 void trainClass(ResultSet results, java.lang.String className, java.lang.String fieldName)
          Generates a classifier based on the documents in the provided result set.
 void trainClass(ResultSet results, java.lang.String className, java.lang.String fieldName, Progress p)
          Generates a classifier based on the documents in the provided result set.
 void trainClass(ResultSet results, java.lang.String className, java.lang.String fieldName, java.lang.String fromField)
          Generates a classifier based on the documents in the provided result set.
 void trainClass(ResultSet results, java.lang.String className, java.lang.String fieldName, java.lang.String fromField, Progress progress)
          Generates a classifier based on the documents in the provided result set.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

indexConfig

protected IndexConfig indexConfig
The configuration for the index and the indexing engine.


queryConfig

protected QueryConfig queryConfig
The configuration for the query engine.


metaDataStore

protected MetaDataStoreImpl metaDataStore
The meta data storage for this engine/index


cm

protected com.sun.labs.util.props.ConfigurationManager cm
The configuration manager for this engine.


invFilePartitionManager

protected PartitionManager invFilePartitionManager
The manager for the partitions in this index.


classManager

protected ClassifierManager classManager
The manager for the classifier partitions in this index.


clusterManager

protected ClusterManager clusterManager
The manager for the cluster partitions in this index.


classMemoryPartition

protected ClassifierMemoryPartition classMemoryPartition
The memory partition for building classifiers


clusterMemoryPartition

protected ClusterMemoryPartition clusterMemoryPartition
The memory partition for building feature clusters


indexingQueue

protected java.util.concurrent.BlockingQueue indexingQueue
A blocking queue upon which we can put indexable things.


pipes

protected Pipeline[] pipes
The pipelines to use for indexing.


pipeThreads

protected java.lang.Thread[] pipeThreads
Threads to hold run our pipelines.


form

protected static java.text.DecimalFormat form
A format object for formatting the output.


logTag

protected static java.lang.String logTag
Our log tag.


PROP_INDEX_CONFIG

@ConfigComponent(type=IndexConfig.class)
public static final java.lang.String PROP_INDEX_CONFIG
See Also:
Constant Field Values

PROP_QUERY_CONFIG

@ConfigComponent(type=QueryConfig.class)
public static final java.lang.String PROP_QUERY_CONFIG
See Also:
Constant Field Values

PROP_PIPELINE_FACTORY

@ConfigComponent(type=PipelineFactory.class)
public static final java.lang.String PROP_PIPELINE_FACTORY
See Also:
Constant Field Values

PROP_INV_FILE_PARTITION_MANAGER

@ConfigComponent(type=PartitionManager.class)
public static final java.lang.String PROP_INV_FILE_PARTITION_MANAGER
See Also:
Constant Field Values

PROP_BUILD_CLASSIFIERS

@ConfigBoolean(defaultValue=false)
public static final java.lang.String PROP_BUILD_CLASSIFIERS
A property indicating whether we should build classifiers while indexing or not.

See Also:
Constant Field Values

PROP_CLASS_MANAGER

@ConfigComponent(type=ClassifierManager.class)
public static final java.lang.String PROP_CLASS_MANAGER
See Also:
Constant Field Values

PROP_CLUSTER_MANAGER

@ConfigComponent(type=ClusterManager.class)
public static final java.lang.String PROP_CLUSTER_MANAGER
See Also:
Constant Field Values

PROP_CLASS_MEMORY_PARTITION

@ConfigComponent(type=MemoryPartition.class)
public static final java.lang.String PROP_CLASS_MEMORY_PARTITION
See Also:
Constant Field Values

PROP_CLUSTER_MEMORY_PARTITION

@ConfigComponent(type=ClusterMemoryPartition.class)
public static final java.lang.String PROP_CLUSTER_MEMORY_PARTITION
See Also:
Constant Field Values

PROP_MIN_MEMORY_PERCENT

@ConfigDouble(defaultValue=0.3)
public static final java.lang.String PROP_MIN_MEMORY_PERCENT
See Also:
Constant Field Values

PROP_DUMPER

@ConfigComponent(type=Dumper.class)
public static final java.lang.String PROP_DUMPER
See Also:
Constant Field Values

PROP_NUM_PIPELINES

@ConfigInteger(defaultValue=1)
public static final java.lang.String PROP_NUM_PIPELINES
See Also:
Constant Field Values

PROP_INDEXING_QUEUE_LENGTH

@ConfigInteger(defaultValue=256)
public static final java.lang.String PROP_INDEXING_QUEUE_LENGTH
See Also:
Constant Field Values

PROP_CLASSIFIER_CLASS_NAME

@ConfigString(defaultValue="com.sun.labs.minion.classification.Rocchio")
public static final java.lang.String PROP_CLASSIFIER_CLASS_NAME
See Also:
Constant Field Values

PROP_PROFILERS

@ConfigComponentList(type=Profiler.class)
public static final java.lang.String PROP_PROFILERS
See Also:
Constant Field Values

PROP_LONG_INDEXING_RUN

@ConfigBoolean(defaultValue=false)
public static final java.lang.String PROP_LONG_INDEXING_RUN
A property that indicates that the search engine will be used for a long indexing run with no querying going on during that time. If this property is set to true (the default is false), then no term statistics dictionaries or document vector lengths will be calculated during indexing or merging of partitions. Additionally, at shutdown, the extant partitions will be merged into a single partition and then term statistics and document vector lengths will be calculated for that single new partition.

See Also:
Constant Field Values
Constructor Detail

SearchEngineImpl

public SearchEngineImpl()
Gets a search engine implementation.

Method Detail

defineField

public FieldInfo defineField(FieldInfo field)
                      throws SearchEngineException
Description copied from interface: SearchEngine
Defines a given field. Once a field has been defined, its attributes and type cannot be changed, although it can be redefined with the same attributes and types.

Specified by:
defineField in interface SearchEngine
Parameters:
field - the field to define
Returns:
the defined field information object, including an ID assigned by the engine.
Throws:
SearchEngineException - if the field is already defined and there is a mismatch in the attributes or type of the given field or if there is an error adding the field to the index

setDefaultFieldInfo

public void setDefaultFieldInfo(FieldInfo field)
Description copied from interface: SearchEngine
Sets the default field information to use when unknown fields are encountered during indexing.

Specified by:
setDefaultFieldInfo in interface SearchEngine
Parameters:
field - an exemplar field information object that has the attributes and type that should be used when an unknows field is encountered during indexing. Note that any name associated with this particular object will be ignored, we are only interested in the attributes and type associated with this field.
See Also:
for how to define a field to use during indexing

getFieldInfo

public FieldInfo getFieldInfo(java.lang.String name)
Gets the information for a field.

Specified by:
getFieldInfo in interface SearchEngine
Parameters:
name - the name of the field for which we want information
Returns:
the information associated with this field, or null if this name is not the name of a defined field.

getTermVariations

public java.util.Set<java.lang.String> getTermVariations(java.lang.String term)
Description copied from interface: SearchEngine
Gets the set of variations on a term that will be generated by default when searching for the term. The composition of the set depends on the configuration of the engine and will be returned in no particular order.

Specified by:
getTermVariations in interface SearchEngine
Parameters:
term - the term for which we want variants
Returns:
the set of variants for the term, which will always include the term itself. The case of the variants will match (as much as possible) the case of the provided term.

getTermStats

public TermStats getTermStats(java.lang.String term)
Description copied from interface: SearchEngine
Gets the collection level term statistics for the given term.

Specified by:
getTermStats in interface SearchEngine
Parameters:
term - the term for which we want the statisitics
Returns:
the statistics associated with the given term, or null if the term does not occur in the collection.

getDocument

public Document getDocument(java.lang.String key)
Description copied from interface: SearchEngine
Gets a document with a given key.

Specified by:
getDocument in interface SearchEngine
Parameters:
key - the key for the document to retrieve.
Returns:
a document with the given key. If the given key does not occur in the index, then null is returned.
See Also:
SearchEngine.index(Document), SearchEngine.createDocument(java.lang.String), SimpleIndexer.indexDocument(Document)

getDocuments

public java.util.List<Document> getDocuments(java.util.List<java.lang.String> keys)
Description copied from interface: SearchEngine
Gets a list of documents with the given keys.

Specified by:
getDocuments in interface SearchEngine
Parameters:
keys - the list of keys for which we want documents
Returns:
a list of the documents corresponding to the keys in the list. Note that this list will not include documents that have been deleted and no documents will be returned for keys that do not exist in the index, so there may not be a one-to-one correspondence between the keys in keys and the documents in the returned list.

createDocument

public Document createDocument(java.lang.String key)
Description copied from interface: SearchEngine
Creates a new document with a given key.

Specified by:
createDocument in interface SearchEngine
Parameters:
key - the key for the new document
Returns:
a new document with the given key. If the given key is already in the index, then null is returned.
See Also:
SearchEngine.index(Document), SearchEngine.getDocument(java.lang.String), SimpleIndexer.indexDocument(Document)

index

public void index(java.lang.String key,
                  java.util.Map document)
           throws SearchEngineException
Indexes a document into the database. If the document already exists in the database, the new information will replace the old.

Note that simply calling index will not make a document available for searching. Documents are not available until they are flushed to disk. This can be accomplished using the flush method.

Specified by:
index in interface SearchEngine
Parameters:
key - The document key for this document. The key should be unique in the index. If the key passed in matches a document that is already in the index, the information for this document will replace the existing one.
document - A map from field names to the value for that field. If a particular field has a type or attributes associated with it, they will be respected during indexing. If a field has no attributes associated with it, the field will be tokenized and indexed.
Throws:
SearchEngineException - if there are any errors during the indexing.
See Also:
IndexConfig.IndexConfig(java.lang.String)

index

public void index(Indexable doc)
           throws SearchEngineException
Description copied from interface: SearchEngine
Indexes a document into the database. If the document already exists in the database, the new information will replace the old.

Note that simply calling index will not make a document available for searching. Documents are not available until they are flushed to disk. This can be accomplished using the flush method.

Specified by:
index in interface SearchEngine
Parameters:
doc - the document to index.
Throws:
SearchEngineException - if there are any errors during the indexing.

index

public void index(Document document)
           throws SearchEngineException
Description copied from interface: SearchEngine
Indexes a document into the database. If the document already exists in the database, the new information will replace the old.

In this case, the data for the document will be flushed to disk as soon as the document is indexed. For indexing a large number of documents, you may wish to consider the SimpleIndexer.indexDocument(Document) method, which will allow you more control over when the data will be flushed to disk.

Specified by:
index in interface SearchEngine
Parameters:
document - a document to be indexed
Throws:
SearchEngineException - if there are any errors during the indexing.
See Also:
SearchEngine.getDocument(java.lang.String), SimpleIndexer.indexDocument(Document)

addIndexListener

public void addIndexListener(IndexListener il)
Description copied from interface: SearchEngine
Adds a listener for events in the index backing this search engine.

Specified by:
addIndexListener in interface SearchEngine
Parameters:
il - the listener to add.

removeIndexListener

public void removeIndexListener(IndexListener il)
Description copied from interface: SearchEngine
Removes an index listener from the listeners.

Specified by:
removeIndexListener in interface SearchEngine
Parameters:
il - the index listener to remove.

checkDump

public void checkDump()
               throws SearchEngineException
Throws:
SearchEngineException

dump

protected void dump()
             throws SearchEngineException
Dumps any data currently held in memory to the disk via our configured dumper.

Throws:
SearchEngineException

flush

public void flush()
           throws SearchEngineException
Flushes the indexed material currently held in memory to the disk, making it available for searching.

Specified by:
flush in interface SearchEngine
Throws:
SearchEngineException - If there is any error flushing the in-memory data.

isIndexed

public boolean isIndexed(java.lang.String key)
Checks to see if a document is in the index.

Specified by:
isIndexed in interface SearchEngine
Parameters:
key - the key for the document that we wish to check.
Returns:
true if the document is in the index. A document is considered to be in the index if a document with the given key appears in the index and has not been deleted.

delete

public void delete(java.lang.String key)
Deletes a document from the index.

Specified by:
delete in interface SearchEngine
Parameters:
key - The key for the document to delete.

delete

public void delete(java.util.List<java.lang.String> docs)
            throws SearchEngineException
Deletes a number of documents from the index.

Specified by:
delete in interface SearchEngine
Parameters:
docs - The keys of the documents to delete
Throws:
SearchEngineException - If there is any error deleting the documents.

search

public ResultSet search(java.lang.String query)
                 throws SearchEngineException
Runs a query against the index, returning a set of results.

Specified by:
search in interface SearchEngine
Specified by:
search in interface Searcher
Parameters:
query - The query to run, in our query syntax.
Returns:
An instance of ResultSet containing the results of the query.
Throws:
SearchEngineException - If there is any error during the search.
See Also:
ResultSet

search

public ResultSet search(java.lang.String query,
                        java.lang.String sortOrder)
                 throws SearchEngineException
Runs a query against the index, returning a set of results.

Specified by:
search in interface SearchEngine
Specified by:
search in interface Searcher
Parameters:
query - The query to run, in our query syntax.
sortOrder - How the results should be sorted. This is a set of comma-separated field names, each preceeded by a + (for increasing order) or by a - (for decreasing order).
Returns:
An instance of ResultSet containing the results of the query.
Throws:
SearchEngineException - If there is any error during the search.
See Also:
ResultSet

search

public ResultSet search(java.lang.String query,
                        java.lang.String sortOrder,
                        int defaultOperator,
                        int grammar)
                 throws SearchEngineException
Runs a query against the index, returning a set of results.

Specified by:
search in interface SearchEngine
Specified by:
search in interface Searcher
Parameters:
query - The query to run, in our query syntax.
sortOrder - How the results should be sorted. This is a set of comma-separated field names, each preceeded by a + (for increasing order) or by a - (for decreasing order).
defaultOperator - specifies the default operator to use when no other operator is provided between terms in the query. Valid values are defined in the Searcher interface
grammar - specifies the grammar to use to parse the query. Valid values ar edefined in the Searcher interface
Returns:
An instance of ResultSet containing the results of the query.
Throws:
SearchEngineException - If there is any error during the search.

search

public ResultSet search(Element el)
                 throws SearchEngineException
Description copied from interface: SearchEngine
Runs a query against the index, returning a set of results.

Specified by:
search in interface SearchEngine
Parameters:
el - the query, expressed using the programattic query API
Returns:
the set of documents that match the query
Throws:
SearchEngineException - if there are any errors evaluating the query

search

public ResultSet search(Element el,
                        java.lang.String sortOrder)
                 throws SearchEngineException
Description copied from interface: SearchEngine
Runs a query against the index, returning a set of results.

Specified by:
search in interface SearchEngine
Parameters:
el - the query, expressed using the programattic query API
sortOrder - How the results should be sorted. This is a set of comma-separated field names, each preceeded by a + (for increasing order) or by a - (for decreasing order).
Returns:
the set of documents that match the query
Throws:
SearchEngineException - if there are any errors evaluating the query

getQueryStats

public QueryStats getQueryStats()
Description copied from interface: SearchEngine
Gets the combined query stats for any queries run by the engine.

Specified by:
getQueryStats in interface SearchEngine
Returns:
the combined query statistics
See Also:
SearchEngine.resetQueryStats()

resetQueryStats

public void resetQueryStats()
Description copied from interface: SearchEngine
Resets the query stats for the engine.

Specified by:
resetQueryStats in interface SearchEngine
See Also:
SearchEngine.getQueryStats()

addQueryStats

public void addQueryStats(QueryStats qs)

getResults

public ResultSet getResults(java.util.Collection<java.lang.String> keys)
Gets a set of results corresponding to the document keys passed in. This is a convenience method to go from document keys to something upon which more complicated computations can be done.

Specified by:
getResults in interface SearchEngine
Parameters:
keys - a collection of document keys for which we want results.
Returns:
a result set that includes the documents whose keys occur in the list. All documents in the set will be assigned a score of 1. Note that documents that have been deleted will not appear in the result set.

getResults

public ResultSet getResults(java.util.Map<java.lang.String,java.lang.Float> keys)
Gets a set of results corresponding to the document keys and scores passed in. This is a convenience method to go from document keys and scores to something upon which more complicated computations can be done.

Parameters:
keys - a collection of document keys for which we want results.
Returns:
a result set that includes the documents whose keys occur in the list. All documents in the set will be assigned a score of 1. Note that documents that have been deleted will not appear in the result set.

anyTerms

public ResultSet anyTerms(java.util.Collection<java.lang.String> terms,
                          java.util.Collection<java.lang.String> fields)
                   throws SearchEngineException
Builds a result set of the documents containing any of the given terms in any of the given fields.

Parameters:
terms - the terms to look for
fields - the fields to look for the terms in
Returns:
the set of documents that contain any of the given terms in any of the given fields.
Throws:
SearchEngineException

allTerms

public ResultSet allTerms(java.util.Collection<java.lang.String> terms,
                          java.util.Collection<java.lang.String> fields)
                   throws SearchEngineException
Builds a result set containing all of the given terms in any of the given fields.

Parameters:
terms - the terms that we want to find
fields - the fields that we must find the terms in
Throws:
SearchEngineException - if there is an error during the search.

getMatching

public java.util.SortedSet<FieldValue> getMatching(java.lang.String field,
                                                   java.lang.String pattern)
Gets the values for the given field that match the given pattern.

Specified by:
getMatching in interface SearchEngine
Parameters:
field - the saved, string field against whose values we will match. If the named field is not saved or is not a string field, then the empty set will be returned.
pattern - the pattern for which we'll find matching field values.
Returns:
a sorted set of field values. This set will be ordered by the proportion of the field value that is covered by the given pattern.

getFieldIterator

public java.util.Iterator getFieldIterator(java.lang.String field)
Gets an iterator for all the values in a field. The values are returned by the iterator in the order defined by the field type.

Specified by:
getFieldIterator in interface SearchEngine
Parameters:
field - The name of the field who's values we need an iterator for.
Returns:
An iterator for the given field. If the field is not a saved field, then an iterator that will return no values will be returned.
See Also:
FieldInfo.getType()

getAllFieldValues

public java.util.List getAllFieldValues(java.lang.String field,
                                        java.lang.String key)
Gets all of the field values associated with a given field in a given document.

Specified by:
getAllFieldValues in interface SearchEngine
Parameters:
field - The name of the field for which we want the values.
key - The key of the document whose values we want.
Returns:
A List containing values of the appropriate type. If the named field is not a saved field, or if the given document key is not in the index, then an empty list is returned.

getTopFieldValues

public java.util.List<FieldFrequency> getTopFieldValues(java.lang.String field,
                                                        int n,
                                                        boolean ignoreCase)
Gets a list of the top n most frequent field values for a given named field. If n is < 1, all field values are returned, in order of their frequency from most to least frequent.

Specified by:
getTopFieldValues in interface SearchEngine
Parameters:
field - the name of the field to rank
n - the number of field values to return
Returns:
a List containing field values of the appropriate type for the field, ordered by frequency

getSimilarClassifiers

public java.util.List<FieldValue> getSimilarClassifiers(java.lang.String cname,
                                                        int n)

getSimilarClassifierTerms

public java.util.List<WeightedFeature> getSimilarClassifierTerms(java.lang.String cname1,
                                                                 java.lang.String cname2,
                                                                 int n)

getFieldValue

public java.lang.Object getFieldValue(java.lang.String field,
                                      java.lang.String key)
Gets a single field value associated with a given field in a given document.

Specified by:
getFieldValue in interface SearchEngine
Parameters:
field - The name of the field for which we want the values.
key - The key of the document whose values we want.
Returns:
An Object of the appropriate type for the named field. If the named field is not a saved field, or if the given document key is not in the index, then null is returned.

Note that if there are multiple values for the given field, there is no guarantee which of the values will be returned by this method.

See Also:
getAllFieldValues(java.lang.String, java.lang.String)

getFieldNames

public java.util.Collection getFieldNames()
Gets the names of all the fields known in the index

Returns:
a collection of String

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key)
Gets a document vector for the given key.

Specified by:
getDocumentVector in interface SearchEngine
Parameters:
key - The key for the document whose vector we are to retrieve.
Returns:
An instance of DocumentVector containing the vector for this document.
See Also:
DocumentVector

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key,
                                        java.lang.String field)
Gets a document vector for the given key.

Specified by:
getDocumentVector in interface SearchEngine
Parameters:
key - The key for the document whose vector we are to retrieve.
field - the field for which we want a document vector. If this parameter is null, then a vector containing the terms from all vectored fields in the document is returned. If this value is the empty string, then a vector for the contents of the document that are not in any field are returned. If this value is the name of a field that was not vectored during indexing, an empty vector will be returned.
Returns:
An instance of DocumentVector containing the vector for this document.
See Also:
DocumentVector

getDocumentVector

public DocumentVector getDocumentVector(java.lang.String key,
                                        WeightedField[] fields)
Description copied from interface: SearchEngine
Gets a composite document vector for the given linear combination of vectored fields for the given key.

Specified by:
getDocumentVector in interface SearchEngine
Parameters:
key - the key of the document whose vector we will return
fields - the fields from which the document vector will be composed.
Returns:
the vector for that document, or null if that key does not appear in this index.

getDocumentVector

public DocumentVector getDocumentVector(Document doc,
                                        java.lang.String field)
                                 throws SearchEngineException
Description copied from interface: SearchEngine
Creates a document vector for the given document as though it occurred in the index.

Specified by:
getDocumentVector in interface SearchEngine
Parameters:
doc - a document for which we want a document vector. This document may be in the index or may be generated via the SearchEngine.createDocument(java.lang.String) method. The document will be processed as though it were being indexed in order to extract the appropriate document vector, but the data resulting from this processing will not be added to the index.
field - the field for which we want a document vector. If this parameter is null, then a vector containing the terms from all vectored fields in the document is returned. If this value is the empty string, then a vector for the contents of the document that are not in any field are returned. If this value is the name of a field that was not vectored during indexing, an empty vector will be returned.
Returns:
the vector for the given document, taking into account the restrictions in the field parameter
Throws:
SearchEngineException

getDocumentVector

public DocumentVector getDocumentVector(Document doc,
                                        WeightedField[] fields)
                                 throws SearchEngineException
Description copied from interface: SearchEngine
Creates a composite document vector for the given document as though it occurred in the index.

Specified by:
getDocumentVector in interface SearchEngine
Parameters:
doc - a document for which we want a document vector. This document may be in the index or may be generated via the SearchEngine.createDocument(java.lang.String) method. The document will be processed as though it were being indexed in order to extract the appropriate document vector, but the data resulting from this processing will not be added to the index.
fields - the fields for which we want a document vector.
Throws:
SearchEngineException

getDocumentTerm

public DocKeyEntry getDocumentTerm(java.lang.String key)

getSimilar

public ResultSet getSimilar(java.lang.String key,
                            java.lang.String name)
Gets a set of results ordered by similarity to the given document, calculated by computing the euclidean distance based on the feature vector stored in the given field.

Specified by:
getSimilar in interface SearchEngine
Parameters:
key - the key of the document to which we'll compute similarity.
name - the name of the field containing the feature vectors that we'll use in the similarity computation.
Returns:
a result set containing the distance between the given document and all of the documents. The scores assigned to the documents are the distance scores, and so the returned set will be set to be sorted in increasing order of the document score. It is up to the application to handle the scores in whatever way they deem appropriate.

getDistance

public double getDistance(java.lang.String k1,
                          java.lang.String k2,
                          java.lang.String name)
Gets the distance between two documents, based on the values stored in in a given feature vector saved field.

Specified by:
getDistance in interface SearchEngine
Parameters:
k1 - the first key
k2 - the second key
name - the name of the feature vector field for which we want the distance
Returns:
the euclidean distance between the two documents' feature vectors. If the field value is not defined for either of the two documents, Double.POSITIVE_INFINITY is returned.

purge

public void purge()
Deletes all of the data in the index.

Specified by:
purge in interface SearchEngine

merge

public boolean merge()
Performs a merge in the index, if one is necessary. Returns control to the caller when the merge is completed. If you do not set the asyncMerges property to true, you will need to call this method periodically to cause merges to happen. If you do not, you may run out of file handles, leading to exceptions.

Specified by:
merge in interface SearchEngine
Returns:
true if a merge was performed, false otherwise.

optimize

public void optimize()
              throws SearchEngineException
Merges all of the partitions in the index into a single partition.

Specified by:
optimize in interface SearchEngine
Throws:
SearchEngineException - If there is any error during the merge.

recover

public void recover()
             throws SearchEngineException
Attempts to recover the index after an unruly shutdown. Makes sure that lock files are removed.

Specified by:
recover in interface SearchEngine
Throws:
SearchEngineException - If there is any error during the recovery.

getDocumentIterator

public java.util.Iterator<Document> getDocumentIterator()
Gets an iterator for all of the non-deleted documents in the collection.

Specified by:
getDocumentIterator in interface SearchEngine
Returns:
An iterator that will return the keys of all of the non-deleted documents in the index, as strings. The iterators will be returned in document ID order.

close

public void close()
           throws SearchEngineException
Closes the engine. If you wish to reuse a closed engine, you must use the constructor to get a new engine!

Specified by:
close in interface SearchEngine
Throws:
SearchEngineException - If there is any error closing the engine.

getName

public java.lang.String getName()
Gets the name of this engine, if one has been assigned by the application.

Specified by:
getName in interface SearchEngine
Returns:
The name of the engine assigned by the application, or null if none has been assigned.

getNDocs

public int getNDocs()
Gets the number of documents that the index contains.

Specified by:
getNDocs in interface SearchEngine
Returns:
The number of documents in the index. This number does not include documents that have been deleted but whose data has not been garbage collected.

getQueryConfig

public QueryConfig getQueryConfig()
Gets the query configuration being used by this search engine.

Specified by:
getQueryConfig in interface SearchEngine
Returns:
The current query configuration in use by this engine.

getSimpleIndexer

public SimpleIndexer getSimpleIndexer()
Gets a simple indexer that can be used for simple indexing.

Specified by:
getSimpleIndexer in interface SearchEngine
Returns:
a simple indexer that will index documents into this engine.

getHLPipeline

public HLPipeline getHLPipeline()
Gets a pipeline that can be used for highlighting.

Specified by:
getHLPipeline in interface SearchEngine
Returns:
An instance of Pipeline that can be used to highlight passages in documents returned by a search.

toString

public java.lang.String toString()
Gets a string description of the search engine.

Overrides:
toString in class java.lang.Object
Returns:
a string description of the search engine.

getPM

public PartitionManager getPM()
Gets the partition manager for this search engine. This is for testing purposes only and not for general consumption.

Specified by:
getPM in interface SearchEngine
Returns:
The partition manager for this search engine.

getManager

public PartitionManager getManager()
Gets the partition manager associated with this search engine.

Specified by:
getManager in interface SearchEngine
Returns:
The partition manager associated with this search engine.

getClassifierManager

public ClassifierManager getClassifierManager()
Gets the classifier manager for this search engine.

Returns:
the classifier manager

getClusterManager

public ClusterManager getClusterManager()
Gets the cluster manager for this search engine.

Returns:
the cluster manager

flushClassifiers

public void flushClassifiers()
                      throws SearchEngineException
Dumps all the classifiers that have been traied since the last dump, or since the searh engine started. That is, all the new classifiers that are currently only in memory will be written out to disk.

Specified by:
flushClassifiers in interface SearchEngine
Throws:
SearchEngineException - if there is any error dumping the classifiers.

trainClass

public void trainClass(ResultSet results,
                       java.lang.String className,
                       java.lang.String fieldName)
                throws SearchEngineException
Description copied from interface: Classifier
Generates a classifier based on the documents in the provided result set. If the name provided is an existing class, then the existing classifier will be replaced. This method does not affect any documents that have already been indexed.

Specified by:
trainClass in interface Classifier
Parameters:
results - the set of documents to use for training the classifier
className - the name of the class to create or replace
fieldName - the name of the field where the results of the classifier should be stored.
Throws:
SearchEngineException - If there is any error training the classifier

trainClass

public void trainClass(ResultSet results,
                       java.lang.String className,
                       java.lang.String fieldName,
                       java.lang.String fromField)
                throws SearchEngineException
Description copied from interface: Classifier
Generates a classifier based on the documents in the provided result set. If the name provided is an existing class, then the existing classifier will be replaced. This method does not affect any documents that have already been indexed.

Specified by:
trainClass in interface Classifier
Parameters:
results - the set of documents to use for training the classifier
className - the name of the class to create or replace
fieldName - the name of the field where the results of the classifier should be stored.
fromField - the vectored field from which we should build the classifiers. If this parameter is null then data from all indexed fields will be used. If this parameter is the empty string, then data from the "body" field will be used.
Throws:
SearchEngineException - If there is any error training the classifier

trainClass

public void trainClass(ResultSet results,
                       java.lang.String className,
                       java.lang.String fieldName,
                       Progress p)
                throws SearchEngineException
Description copied from interface: Classifier
Generates a classifier based on the documents in the provided result set. If the name provided is an existing class, then the existing classifier will be replaced. This method does not affect any documents that have already been indexed.

Specified by:
trainClass in interface Classifier
Parameters:
results - the set of documents to use for training the classifier
className - the name of the class to create or replace
fieldName - the name of the field where the results of the classifier should be stored.
p - a progress monitor that will be notified as training proceeds
Throws:
SearchEngineException - If there is any error training the classifier

trainClass

public void trainClass(ResultSet results,
                       java.lang.String className,
                       java.lang.String fieldName,
                       java.lang.String fromField,
                       Progress progress)
                throws SearchEngineException
Generates a classifier based on the documents in the provided result set. If the name provided is an existing class, then the existing classifier will be replaced. This method does not affect any documents that have already been indexed.

Specified by:
trainClass in interface Classifier
Parameters:
results - the set of documents to use for training the classifier
className - the name of the class to create or replace
fieldName - the name of the field where the results of the classifier should be stored.
fromField - the vectored field from which we should build the classifiers. If this parameter is null then data from all indexed fields will be used. If this parameter is the empty string, then data from the "body" field will be used.
progress - where to send progress events
Throws:
SearchEngineException - If there is any error training the classifier

classify

public void classify(java.lang.String[] docKeys,
                     java.lang.String[] classNames)
              throws SearchEngineException
Creates a manual assignment of a set of documents to a set of classes. All of the documents will be assigned to all of the classes. Manual assignments are stored independently of the automatic assignment the engine performs while indexing. The documents will also automatically be indexed and classified.

Specified by:
classify in interface Classifier
Parameters:
docKeys - the keys of the documents to classify
classNames - the classes to assign the documents to
Throws:
SearchEngineException - if there is any error running the classifiers

reclassifyIndex

public void reclassifyIndex(java.lang.String className)
                     throws SearchEngineException
Causes the engine to reclassify all documents against the classifier for the given class name. Upon completion of the classification, a short pause will occur while switching from the old set of classes to the new set (the implementation of this will determine exactly what the characteristics of the switch are). This method is only needed when there are existing indexed documents and there has been a change to the set of classifiers. Since reclassifying will likely be a lengthy process, it is never implicit in any of the other methods. (Side note: Should this be a blocking call? If not, should there be a simple event/callback mechanism to notify a user of progress?)

Specified by:
reclassifyIndex in interface Classifier
Parameters:
className - the class to reclassify all documents against
Throws:
SearchEngineException - If there is any error training the classifiers

getTrainingDocuments

public ResultSet getTrainingDocuments(java.lang.String className)
                               throws SearchEngineException
Returns the set of documents that was used to train the classifier for the class with the provided class name.

Specified by:
getTrainingDocuments in interface Classifier
Parameters:
className - the name of a class
Returns:
the set of documents that defines the named class
Throws:
SearchEngineException - If there is any error retrieving the training documents

getClasses

public java.lang.String[] getClasses()
Returns the names of the classes for which classifiers are defined. If no classes are defined, an empty array is returned.

Specified by:
getClasses in interface Classifier
Returns:
an array of class names

getClassifier

public ClassifierModel getClassifier(java.lang.String name)

getIndexConfig

public IndexConfig getIndexConfig()
Gets the index configuration in use by this search engine.

Specified by:
getIndexConfig in interface SearchEngine
Returns:
The index configuration in use by this search engine.

getQC

public QueryConfig getQC()

getMetaDataStore

public MetaDataStore getMetaDataStore()
                               throws SearchEngineException
Gets the MetaDataStore for this index. This is a singleton that stores index-related global variables.

Specified by:
getMetaDataStore in interface SearchEngine
Returns:
the MetaDataStore instance
Throws:
SearchEngineException - if there is any error getting the metadata store

checkLowMemory

public boolean checkLowMemory()
Determines if available memory is low. Currently, this is defined by minMemoryPercent.

Returns:
true if memory is low

toMB

protected double toMB(long x)

getConfigurationManager

public com.sun.labs.util.props.ConfigurationManager getConfigurationManager()

newProperties

public void newProperties(com.sun.labs.util.props.PropertySheet ps)
                   throws com.sun.labs.util.props.PropertyException
Specified by:
newProperties in interface com.sun.labs.util.props.Configurable
Throws:
com.sun.labs.util.props.PropertyException

getLongIndexingRun

public boolean getLongIndexingRun()
Indicates whether this search engine is being used for a long indexing run.

Returns:
true if this engine is being use for a long indexing run, in which case term statistics dictionaries and document vector lengths will not be calculated until the engine is shutdown.

setLongIndexingRun

public void setLongIndexingRun(boolean longIndexingRun)
Sets the indicator that this is a long indexing run, in which case term statistics dictionaries and document vector lengths will not be calculated until the engine is shutdown. This should be done before any indexing begins for best results.

Specified by:
setLongIndexingRun in interface SearchEngine
Parameters:
longIndexingRun - true if this is a long indexing run

setQueryConfig

public void setQueryConfig(QueryConfig queryConfig)
Description copied from interface: SearchEngine
Sets the query configuration to use for subsequent queries.

Specified by:
setQueryConfig in interface SearchEngine
Parameters:
queryConfig - a set of properties describing the query configuration.

export

public void export(java.io.PrintWriter o)
            throws java.io.IOException
Description copied from interface: SearchEngine
Outputs an XML representation of the search index including all saved and vectored fields.

Specified by:
export in interface SearchEngine
Parameters:
o - a print writer to which the index will be exported.
Throws:
java.io.IOException - if there is any error writing the data

getProfilers

public java.util.List getProfilers()

setProfilers

public void setProfilers(java.util.List profilers)