Compass Core provides an abstraction layer on top of the wonderful Lucene Search Engine. Compass also provides several additional features on top of Lucene, like two phase transaction management, fast updates, and optimizers. When trying to explain how Compass works with the Search Engine, first we need to understand the Search Engine domain model.
Resource represents a collection of properties. You can think about it as a virtual document - a chunk of data, such as a web page, an e-mail message, or a serialization of the Author object. A Resource is always associated with a single Alias and several Resources can have the same Alias. The alias acts as the connection between a Resource and its mapping definitions (OSEM/XSEM/RSEM). A Property is just a place holder for a name and value (both strings). A Property within a Resource represents some kind of meta-data that is associated with the Resource like the author name.
Every Resource is associated with one or more id properties. They are required for Compass to manage Resource loading based on ids and Resource updates (a well known difficulty when using Lucene directly). Id properties are defined either explicitly in RSEM definitions or implicitly in OSEM/XSEM definitions.
For Lucene users, Compass Resource maps to Lucene Document and Compass Property maps to Lucene Field.
When working with RSEM, resources acts as your prime data model. They are used to construct searchable content, as well as manipulate it. When performing a search, resources be used to display the search results.
Another important place where resources can be used, which is often ignored, is with OSEM/XSEM. When manipulating search content through the use of the application domain model (in case of OSEM), or through the use of xml data structures (in case of XSEM), resources are rarely used. They can be used when performing search operations. Based on your mapping definition, the semantic model could be accessed in a uniformed way through resources and properties.
Lets simplify this statement by using an example. If our application has two object types, Recipe and Ingredient, we can map both recipe title and ingredient title into the same semantic meta-data name, title (Resource Property name). This will allow us when searching to display the search results (hits) only on the Resource level, presenting the value of the property title from the list of resources returned.
Analyzers are components that pre-process input text. They are also used when searching (the search string has to be processed the same way that the indexed text was processed). Therefore, it is usually important to use the same Analyzer for both indexing and searching.
Analyzer is a Lucene class (which qualifies to org.apache.lucene.analysis.Analyzer class). Lucene core itself comes with several Analyzers and you can configure Compass to work with either one of them. If we take the following sentence: "The quick brown fox jumped over the lazy dogs", we can see how the different Analyzers handle it:
whitespace (org.apache.lucene.analysis.WhitespaceAnalyzer): [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] simple (org.apache.lucene.analysis.SimpleAnalyzer): [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] stop (org.apache.lucene.analysis.StopAnalyzer): [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] standard (org.apache.lucene.analysis.standard.StandardAnalyzer): [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Lucene also comes with an extension library, holding many more analyzer implementations (including language specific analyzers). Compass can be configured to work with all of them as well.
A Compass instance acts as a registry of analyzers, with each analyzer bound to a lookup name. Two internal analyzer names within Compass are: default and search. default is the default analyzer that is used when no other analyzer is configured (configuration of using different analyzer is usually done in the mapping definition by referencing a different analyzer lookup name). search is the analyzer used on a search query string when no other analyzer is configured (configuring a different analyzer when executing a search based on a query string is done through the query builder API). By default, when nothing is configured, Compass will use Lucene standard analyzer as the default analyzer.
The following is an example of configuring two analyzers, one that will replace the default analyzer, and another one registered against myAnalyzer (it will probably later be referenced from within the different mapping definitions).
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<searchEngine>
<analyzer name="deault" type="Snowball" snowballType="Lovins">
<stopWords>
<stopWord value="no" />
</stopWords>
</analyzer>
<analyzer name="myAnalyzer" type="Standard" />
</searchEngine>
</compass>
Compass also supports custom implementations of Lucene Analyzer class (note, the same goal might be achieved by implementing an analyzer filter, described later). If the implementation also implements CompassConfigurable, additional settings (parameters) can be injected to it using the configuration file. Here is an example configuration that registers a custom analyzer implementation that accepts a parameter named threshold:
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<searchEngine>
<analyzer name="deault" type="CustomAnalyzer" analyzerClass="eg.MyAnalyzer">
<setting name="threshold">5</setting>
</analyzer>
</searchEngine>
</compass>
Filters are provided for simpler support for additional filtering (or enrichment) of analyzed streams, without the hassle of creating your own analyzer. Also, filters, can be shared across different analyzers, potentially having different analyzer types.
A custom filter implementation need to implement Compass LuceneAnalyzerTokenFilterProvider, which single method creates a Lucene TokenFilter. Filters are registered against a name as well, which can then be used in the analyzer configuration to reference them. The next example configured two analyzer filters, which are applied on to the default analyzer:
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<searchEngine>
<analyzer name="deafult" type="Standard" filters="test1, test2" />
<analyzerFilter name="test1" type="eg.AnalyzerTokenFilterProvider1">
<setting name="param1" value="value1" />
</analyzerFilter>
<analyzerFilter name="test2" type="eg.AnalyzerTokenFilterProvider2">
<setting name="paramX" value="valueY" />
</analyzerFilter>
</searchEngine>
</compass>
Since synonyms are a common requirement with a search application, Compass comes with a simple synonym analyzer filter: SynonymAnalyzerTokenFilterProvider. The implementation requires as a parameter (setting) an implementation of a SynonymLookupProvider, which can return all the synonyms for a given value. No implementation is provided, though one that goes to a public synonym database, or a file input structure is simple to implement. Here is an example of how to configure it:
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<searchEngine>
<analyzer name="deafult" type="Standard" filters="synonymFilter" />
<analyzerFilter name="synonymFilter" type="synonym">
<setting name="lookup" value="eg.MySynonymLookupProvider" />
</analyzerFilter>
</searchEngine>
</compass>
Note the fact that we did not set the fully qualified class name for the type, and used synonym. This is a simplification that comes with Compass (naturally, you can still use the fully qualified class name of the synonym token filter provider).
It is very important to understand how the Search Engine index is organized so we can than talk about transaction, optimizers, and sub index hashing. The following structure shows the Search Engine Index Structure:

Compass Index Structure
Every sub-index has it's own fully functional index structure (which maps to a single Lucene index). The Lucene index part holds a "meta data" file about the index (called segments) and 0 to N segment files. The segments can be a single file (if the compound setting is enabled) or multiple files (if the compound setting is disable). A segment is close to a fully functional index, which hold the actual inverted index data (see Lucene documentation for a detailed description of these concepts).
Index partitioning is one of Compass main features, allowing for flexible and configurable way to manage complex indexes and performance considerations. The next sections will explain in more details why this feature is important, especially in terms of transaction management.
Compass Search Engine abstraction provides support for transaction management on top of Lucene. The abstraction support common transaction levels: read_committed and serializable, as well as the special batch_insert one. Compass provides two phase commit support for the common transaction levels only.
Compass utilizes Lucene inter and outer process locking mechanism and uses them to establish it's transaction locking. Note that the transaction locking is on the "sub-index" level (the sub index based index), which means that dirty operations only lock their respective sub-index index. So, the more aliases / searchable content map to the same index (next section will explain how to do it - called sub index hashing), the more aliases / searchable content will be locked when performing dirty operations, yet the faster the searches will be. Lucene uses a special lock file to manage the inter and outer process locking which can be set in the Compass configuration. You can manage the transaction timeout and polling interval using Compass configuration.
A Compass transaction acquires a lock only when a dirty (i.e. create, save or delete) operation occurs, which makes "read only" transactions as fast as they should and can be. The following configuration file shows how to control the two main settings for locking, the locking timeout (which defaults to 10 seconds) and the locking polling interval (how often Compass will check and see if a lock is released or not) (defaults to 100 milli-seconds):
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<transaction lockTimeout="15" lockPollInterval="200" />
</compass>
Compass provides support for read_committed transaction level. When starting a read_committed transaction, no locks are obtained. Read operation will not obtain a lock either. A lock will be obtained only when a dirty operation is performed. The lock is obtained only on the index of the alias / searchable content that is associated with the dirty operation, i.e the sub-index, and will lock all other aliases / searchable content that map to that sub-index. In Compass, every transaction that performed one or more save or create operation, and committed successfully, creates another segment in the respective index (different than how Lucene manages it's index), which helps in implementing quick transaction commits, fast updates, as well as paving the way for a two phase commit support (and the reason behind having optimizers).
The serializable transaction level operates the same as the read_committed transaction level, except that when the transaction is opened/started, a lock is acquired on all the sub-indexes. This causes the transactional operations to be sequential in nature (as well as being a performance killer).
A special transaction level, batch_insert utilizes the extremely fast batch indexing provided by Lucene. The transaction supports only create operation, but note that if another Resource with the same alias and ids already exists in the system, you will have two instances of it in the index (in other words, create doesn't delete the old Resource). You can control the batch_insert transaction using several settings which are explained in the Configuration section. An important note is that the transaction is not a transaction which can be rolled back, since Lucene commits the changes during the batch indexing process, which means that a rollback operation won't rollback the changes. The index is optimized when the transaction is committed, which means that all the segments are merged to one segment, in order to provide fast searching. The transaction is mainly used for background batch indexing.
Searchable content is mapped to the search engine using Compass different mapping definitions (OSEM/XSEM/RSEM). Compass provides the ability to partition the searchable content into different sub indexes, as shown in the next diagram:

Sub Index Hashing
In the above diagram A, B, C, and D represent aliases which in turn stands for the mapping definitions of the searchable content. A1, B2, and so on, are actual instances of the mentioned searchable content. The diagram shows the different options of mapping searchable content into different sub indexes.
The simplest way to map aliases (stands for the mapping definitions of a searchable content) is by mapping all its searchable content instances into the same sub index. Defining how searchable content mapping to the search engine (OSEM/XSEM/RSEM) is done within the respectable mapping definitions. There are two ways to define a constant mapping to a sub index, the first one (which is simpler) is:
<compass-core-mapping>
<[mapping] alias="test-alias" sub-index="test-subindex">
<!-- ... -->
</[mapping]>
</compass-core-mapping>
The mentioned [mapping] that is represented by the alias test-alias will map all its instances to test-subindex. Note, if sub-index is not defined, it will default to the alias value.
Another option, which probably will not be used to define constant sub index hashing, but shown here for completeness, is by specifying the constant implementation of SubIndexHash within the mapping definition (explained in details later in this section):
<compass-core-mapping>
<[mapping] alias="test-alias">
<sub-index-hash type="org.compass.core.engine.subindex.ConstantSubIndexHash">
<setting name="subIndex" value="test-subindex" />
</sub-index-hash>
<!-- ... -->
</[mapping]>
</compass-core-mapping>
Here is an example of how three different aliases: A, B and C can be mapped using constant sub index hashing:

Modulo Sub Index Hashing
Constant sub index hashing allows to map an alias (and all its searchable instances it represents) into the same sub index. The modulo sub index hashing allows for partitioning an alias into several sub indexes. The partitioning is done by hashing the alias value with all the string values of the searchable content ids, and then using the modulo operation against a specified size. It also allows setting a constant prefix for the generated sub index value. This is shown in the following diagram:

Modulo Sub Index Hashing
Here, A1, A2 and A3 represent different instances of alias A (let it be a mapped Java class in OSEM, a Resource in RSEM, or an XmlObject in XSEM), with a single id mapping with the value of 1, 2, and 3. A modulo hashing is configured with a prefix of test, and a size of 2. This resulted in the creation of 2 sub indexes, called test_0 and test_1. Based on the hashing function (the alias String hash code and the different ids string hash code), instances of A will be directed to their respective sub index. Here is how A alias would be configured:
<compass-core-mapping>
<[mapping] alias="A">
<sub-index-hash type="org.compass.core.engine.subindex.ModuloSubIndexHash">
<setting name="prefix" value="test" />
<setting name="size" value="2" />
</sub-index-hash>
<!-- ... -->
</[mapping]>
</compass-core-mapping>
Naturally, more than one mapping definition can map to the same sub indexes using the same modulo configuration:

Complex Modulo Sub Index Hashing
ConstantSubIndexHash and ModuloSubIndexHash are implementation of Compass SubIndexHash interface that comes built in with Compass. Naturally, a custom implementation of the SubIndexHash interface can be configured in the mapping definition.
An implementation of SubIndexHash must provide two operations. The first, getSubIndexes, must return all the possible sub indexes the sub index hash implementation can produce. The second, mapSubIndex(String alias, Property[] ids) uses the provided aliases and ids in order to compute the given sub index. If the sub index hash implementation also implements the CompassConfigurable interface, different settings can be injected to it. Here is an example of a mapping definition with custom sub index hash implementation:
<compass-core-mapping>
<[mapping] alias="A">
<sub-index-hash type="eg.MySubIndexHash">
<setting name="param1" value="value1" />
<setting name="param2" value="value2" />
</sub-index-hash>
<!-- ... -->
</[mapping]>
</compass-core-mapping>
As mentioned in the read_committed section, every dirty transaction that is committed successfully creates another segment in the respective sub index. The more segments the index has, the slower the fetching operations take. That's why it is important to keep the index optimized and with a controlled number of segments. We do this by merging small segments into larger segments.
In order to solve the problem, Compass has a SearchEngineOptimizer which is responsible for keeping the number of segments at bay. When Compass is built using CompassConfiguration, the SearchEngineOptimizer is started and when Compass is closed, the SearchEngineOptimizer is stopped.
The optimization process works on a sub index level, performing the optimization for each one. During the optimization process, optimizers will lock the sub index for dirty operations. This causes a tradeoff between having an optimized index, and spending less time on the optimization process in order to allow for other dirty operations.
Each optimizer in Compass can be wrapped to b executed in a scheduled manner. The default behavior within Compass is to schedule the configured optimizer (unless it is the null optimizer). Here is a sample configuration file that controls the scheduling of an optimizer:
<compass name="default">
<connection>
<file path="target/test-index" />
</connection>
<searchEngine>
<optimizer scheduleInterval="90" schedule="true" />
</searchEngine>
</compass>
The AggressiveOptimizer uses Lucene optimization feature to optimize the index. Lucene optimization merges all the segments into one segment. You can set the limit of the number of segments, after which the index is considered to need optimization (the aggressive optimizer merge factor).
Since this optimizer causes all the segments in the index to be optimized into a single segment, the optimization process might take a long time to happen. This means that for large indexes, the optimizer will block other dirty operations for a long time in order to perform the index optimization. It also means that the index will be fully optimized after it, which means that search operations will execute faster. For most cases, the AdaptiveOptimizer should be the one used.
The AdaptiveOptimizer optimizes the segments while trying to keep the optimization time at bay. As an example, when we have a large segment in our index (for example, after we batched indexed the data), and we perform several interactive transactions, the aggressive optimizer will than merge all the segments together, while the adaptive optimizer will only merge the new small segments. You can set the limit of the number of segments, after which the index is considered to need optimization (the adaptive optimizer merge factor).
Compass also comes with a NullOptimizer, which performs no optimizations. It is mainly there if the hosting application developed it's own optimization which is maintained by other means than the SearchEngineOptimizer. It also makes sense to use it when configuring a Compass instance with a batch_insert transaction. It can also be used when the index was built offline and has been fully optimized, and later it is only used for search/read operations.
Compass provides a helpful abstraction layer on top of Lucene, but it also acknowledges that there are cases where direct Lucene access, both in terms of API and constructs, is required. Most of the direct Lucene access is done using the LuceneHelper class. The next sections will describe its main features, for a complete list, please consult its javadoc.
Compass wraps some of Lucene classes, like Query and Filter. There are cases where a Compass wrapper will need to be created out of an actual Lucene class, or an actual Lucene class need to be accessed out of a wrapper.
Here is an example for wrapping the a custom implementation of a Lucene Query with a CompassQuery:
CompassSession session = // obtain a compass session Query myQ = new MyQuery(param1, param2); CompassQuery myCQ = LuceneHelper.createCompassQuery(session, myQ); CompassHits hits = myCQ.hits();
The next sample shows how to get Lucene Explanation, which is useful to understand how a query works and executes:
CompassSession session = // obtain a compass session
CompassHits hits = session.find("london");
for (int i = 0; i < hits.length(); i++) {
Explanation exp = LuceneHelper.getLuceneSearchEngineHits(hits).explain(i);
System.out.println(exp.toString());
}
When performing read operations against the index, most of the time Compass abstraction layer is enough. Sometimes, direct access to Lucene own IndexReader and Searcher are required. Here is an example of using the reader to get all the available terms for the category property name (Note, this is a prime candidate for future inclusion as part of Compass API):
CompassSession session = // obtain a compass session
LuceneSearchEngineInternalSearch internalSearch = LuceneHelper.getLuceneInternalSearch(session);
TermEnum termEnum = internalSearch.getReader().terms(new Term("category", ""));
try {
ArrayList tempList = new ArrayList();
while ("category".equals(termEnum.term().field())) {
tempList.add(termEnum.term().text());
if (!termEnum.next()) {
break;
}
}
} finally {
termEnum.close();
}