Archive

Archive for the ‘Database’ Category

Improvements in mail deduplication

October 4th, 2016 Comments off

The no_duplicate plugin tracks exact duplicates, precisely incoming mail files having the same SHA1 fingerprint as a previously imported mail file.

Up to now, such duplicates could be discarded by simply declaring in manitou-mdx configuration file:


[name@example.com]
incoming_preprocess_plugins = no_duplicate

But when a manitou-mail database is used to only sync new messages from an IMAP server, with hierarchical tags reflecting folders, a message move across IMAP folders is interpreted as a duplicate coming in.

It’s fine and actually desirable not to import the message again, but ideally we’d want to see it in its new folder.

The no_duplicate plugin can now do that by acting both as a incoming_preprocess_plugin and as a incoming_postprocess_plugin.
The first step recognizes the duplicate, and optionally, updates the tags of the message instance already in the database.
The second step associates the SHA1 fingerprint of a newly imported message to its unique ID, which is necessary for the optional tags update to work, if a duplicate of this message comes in the future with different tags.

The declaration taking advantage of this new feature looks like:

[name@example.com]
incoming_preprocess_plugins = no_duplicate({update_tags=>1})
incoming_postprocess_plugins = no_duplicate

For more information on manitou-mdx plugins, see the documentation.

Categories: Database, New features Tags:

SQL functions for tags

February 29th, 2012 Comments off

Tags in Manitou-mail are hierarchical, for several reasons such as the ability to mimic folders. There are pros and cons of this choice, but from the point of view of SQL querying, tree-like structures are clearly more complicated than flat structures. Here are two functions in the wiki that could be of help to compare tags across hierarchies:

  • tag_path(tag_id) returns the full hierarchical path of a tag, with -> as the separator between branches.
  • tag_depth(tag_id) returns the depth of the tag inside its hierarchy, starting at 1

As an example of use, in the custom queries of the user interface, we could use this query:

select mail_id from mail_tags mt join tags t on (mt.tag=t.tag_id) where tag_path(t.tag_id) ilike 'ParentTag->%'

to retrieve any message tagged with any tag whose top-level ancestor is ‘ParentTag’, no matter how deep the tag is inside the hierarchy (child, grandchild, grand grandchild…)

Categories: Database, Development Tags:

Acting on all tagged messages except some

October 29th, 2011 Comments off

Recently I wanted to reduce the size of my main manitou-mail database, and thus I’ve decided to delete all the messages I’ve received from some mailing-lists. I know these messages are archived elsewhere, so that I could re-import them if needed anyway.
But I didn’t like the idea of deleting also the messages that I’ve sent to these mailing-lists, because it would have broken the rule I’ve adopted of keeping all sent messages. Also on second thought, I thought it would be best to include also the whole threads in which I’ve participated, so that the context of the messages would still be available (BTW, the entire thread to which a message belongs can be recalled in the user interface by the contextual menu command: “Show thread on new page”).

So the question was, how to select all the messages tagged with certain tags, but excluding all threads for which at least one message has the Sent status? As usual, the database and the SQL come to help. First I looked up the tag_id’s of the tags corresponding to the mailing-lists, let’s say they were 3,6 and 10. And then I just expressed in SQL the sentence above. The result is:

SELECT mt.mail_id FROM mail_tags mt JOIN mail m1 USING (mail_id)
 WHERE tag IN (3,6,10) AND NOT EXISTS
 (SELECT 1 FROM mail m2 WHERE m2.thread_id=m1.thread_id AND m2.status&128!=0)

After issuing this query with the selection dialog, with the Limit To field empty to ensure that all messages are retrieved, all that was needed to accomplish the task was to select all the messages in the resulting list (Ctrl-A) and hit the Del key.

Categories: Database, Usage Tags:

Removing unused filters

May 12th, 2011 Comments off

For manitou-mail installations that use a lot of filters, it may be a good idea to check from time to time which ones are still useful and which ones are unused.
Since all the filters are evaluated for each incoming message (except if a stop action is encountered), keeping around a large number of obsolete filters may have an adverse impact on CPU usage.
Fortunately, manitou-mdx gathers statistics on filter hits, so it’s easy to find out which filters no longer generate any hit, with the help of some SQL.
Let’s start with a query that retrieve filters that never had any hit:

SELECT expr_id,name FROM filter_expr LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL;

Now it may be that some of the filters returned by this query are new so that no hit on them occurred yet. We need to filter out these by adding a condition on the last_update field, requesting that the filter hasn’t been modified or created as new since at least 3 months.
Also, we only want entries from filter_expr that have actions tied on them, because filters without actions can be used as sub-expressions (that’s advanced filter usage) and don’t generate any hit.
With these additional conditions, the query becomes:

SELECT DISTINCT expr_id,name FROM filter_action JOIN filter_expr USING (expr_id)
 LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL AND filter_expr.last_update<now()-'3 months'::INTERVAL;

With the query above we can check in advance what we’re about to delete.

Now, the deletion itself needs two steps, one for the filter_action table and another for filter_expr. Since both tables are joined in the query, we need a preliminary step to save the expr_id to delete into a temporay table. The SQL sequence including the transaction is:

BEGIN;
 
CREATE TEMPORARY TABLE del_expr AS
SELECT DISTINCT expr_id FROM filter_action JOIN filter_expr USING (expr_id)
 LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL
 AND filter_expr.last_update&lt;now()-'3 months'::INTERVAL;
 
DELETE FROM filter_action WHERE expr_id IN (SELECT expr_id FROM del_expr);
 
DELETE FROM filter_expr WHERE expr_id IN (SELECT expr_id FROM del_expr);
 
COMMIT;

To additionally delete the filters that haven’t been used for a significant period of time (for example one year), they could be added to our temporary table before the above deletion:

INSERT INTO del_expr SELECT expr_id
 FROM filter_log GROUP BY expr_id
 HAVING MAX(hit_date)<now()-'1 year'::INTERVAL)

Happy filters cleaning!

Categories: Database, Usage Tags:

Word search in SQL

April 9th, 2011 Comments off

The current version of manitou-mail uses C++ code inside the interface to deal with the inverted word index. There is also a Perl version in the Manitou::Words module (see sub search), but so far it wasn’t possible to issue a search directly from inside an SQL query, making it hard to combine the results with other criteria.
I’m glad to say that it’s now possible, by using a new wordsearch() function implemented in pl/pgsql.
This opens up the possibility of doing some interesting custom queries, such as for example:

SELECT mail_id,subject,msg_date
 FROM mail m
  JOIN (SELECT * FROM wordsearch(array['foo','bar']) AS id) s ON (m.mail_id=s.id)
 WHERE m.status&16=0 AND msg_date>now()-'1 year'::INTERVAL;

This query retrieves the messages that contain both ‘foo’ and ‘bar’, excluding those that are in the trashcan, and those that are more than one year old.
This should be pretty fast, too. This one runs for me in 0.4s on a database containing about 47000 messages.
Not only this kind of query can be launched by any custom program using the mail database, but it can also be saved inside the manitou-mail user interface as a “user query” and be accessible from the quick selection panel.

The code will be shipped with future versions of manitou, but in the meantime the function source code is available in the wiki.

Categories: Database, Development Tags: