Manitou-Mail v0.9.7 & up

Implementation of the inverted index

Indexer basics

The full-text search functionality uses an inverted index of words. The tables involved are:

words: the dictionary, containing a unique entry per word with a unique word_id.
inverted_word_index: a unique entry per word, with a bit vector field representing the information about which messages contain that word.
non_indexable_words: an optional list of words that should be ignored by the indexer ('stopwords').

Model

Each word is associated to a large vector of bits, each bit corresponding to a different mail message. The Nth bit element of the vector is set to 1 if the Nth mail of the corpus contains that word, or 0 if it doesn't. Also if a word vector's size if less that N, it means that the Nth document doesn't contain the corresponding word.

Example: if within the corpus, the word 'sampleword' appears 3 times, in mail_id=4, mail_id=15 and mail_id=20, the vector for 'sampleword' is such that:

v[1..3] = 0
v[4] = 1
v[5..14] = 0
v[15] = 1
v[16..19] = 0
v[20] = 1

which can be represented as the binary number (with bits numbered from left to right) :

00010000000000100001

The position(s) of the word inside the message text, and the number of its occurrences are NOT part of the model.

Storage

The vectors are partitioned in order to be split into several rows in the inverted_word_index table. The partition key is called 'part_no', and its size is configurable, with a default value of 16384.

For instance, a word vector of 20000 bits will be split into two database rows, the first with part_no=0 and a vector of 16384 bits, the second with part_no=1 and a vector of 20000-16384=3716 bits.

The reason why the vectors are split is to limit the quantity of data that has to be updated each time a document is added or removed. Adding a document consist of adding one bit to the vector corresponding to each different word of the document. These vectors are updated in the database, they're rewritten entirely and generate transactional activity.

For each word contained in the document, the update will hit at most N bits. In the database, a column that gets updated is replaced entirely, even if only one bit is flipped. Replacing vectors of several megabytes would incur massive transactional activity. Vectors partioning reduces the update to one partition only.

A word will be ignored by the indexer if it matches one of these criteria:

it's shorter than 3 characters or longer than 50.
it is composed exclusively of digits.
it appears in the non_indexable_words table.

Also, indexing can be globally disabled using the index_words configuration parameter of manitou-mdx.

Stopwords

Some words are very common and have no meaning when isolated, and therefore are not helpful in a search by keywords. They're commonly refered to as "stopwords".

In addition to words common in a language, mail messages also tend to contain specific words that can be avoided in the index. For example: 'wrote', 'http', 'www', 'com', 'net' are likely candidates whatever the language because of citations and references to URLs, but there's no general case, each corpus having its own vocabulary and top-occuring words.

Words that occur frequently and that are useless in search should be excluded from the index in order to spare disk space.

Vacuuming

Each time a word vector is replaced, a new version of the entire row is created in the table and the old one becomes invisible to new transactions. However, the space used is not freed until a VACUUM occurs.

VACUUM on the inverted_word_index table should be invoked frequently to avoid leaking too much disk space. Simple VACUUM will mark the space as reusable, as opposed to VACUUM FULL that really reclaims the space occupied by dead rows, but has a much heavier disk-activity and CPU resource cost.

The best strategy is just to frequently run:

VACUUM inverted_word_index;

in a cron job, or in an admin plugin called by manitou-mdx, or using the built-in PostgreSQL facility known as autovacuum.

Deferred index update

In order to reduce the transactional activity incurred by updating the word vectors, manitou-mdx defers this job until there is enough messages that have been imported or a configurable amount of time has elapsed since the last flush.

However, there are some drawbacks:

Newly imported messages cannot be retrieved by a word search from a client application, until the next round of index flush.
If manitou-mdx is killed or disconnected from the database, the index cache can't be flushed and messages could stay outside of the index forever.

The first issue can be mitigated by lowering the value of the 'flush_word_index_max_queued' parameter, down to a value of 1 for which the word vector updates are not deferred at all. Also, setting a smaller value for 'flush_word_index_interval' lowers the time interval during which a search could miss a freshly imported message, due to the index not being up-to-date. Both these parameters are set in manitou-mdx configuration file.

The second issue is dealt with by using a queue of messages for which the inverted word index has not yet been flushed to disk, in the jobs_queue table. When a message is commited to the database, so is its entry in the jobs_queue table, in the same transaction. This entry consists of the mail_id and a job_type equal to 'widx' for indexing. Later on, when the corresponding round of inverted index writing is done, this entry is deleted. As a result, the IDs of messages whose contents have not been actually written into the word index (probably because of a previous unexpected termination) can be found anytime in the jobs_queue table. So when manitou-mdx starts, it queries these entries and goes on indexing the corresponding messages, re-reading them back from the database.