Archive

Archive for the ‘Usage’ Category

Parallel import

July 18th, 2016 Comments off

Importing in parallel from a single sourceĀ  is really enabledĀ in manitou-mdx since commit 6a860e, under the following conditions:

  • parallelism is driven from the outside: manitou-mdx instances run concurrently, but don’t fork and manage child workers. Workers don’t share anything. Fortunately GNU parallel can easily handle this part.
  • the custom full text indexing is done once the contents are imported, not during the import. The reason is that it absolutely needs a cache for performance, and such a cache wouldn’t work in the share-nothing implementation mentioned above.

The previous post showed how to create a list of all mail files to import from the Enron sample database.

Now instead of that, let’s create a list splitted in chunks of 25k messages, that will be fed separately to the parallel workers:


$ find . -type f | split -d -l 25000 - /data/enron/list-

The result is 21 numbered files of 25000 lines each, except for the last one, list-20 containing 17401 lines.

The main command is essentially the same as before. As a shell variable:

cmd="mdx/script/manitou-mdx --import-list={} \
--import-basedir=$basedir/maildir \
--conf=$basedir/enron-mdx.conf \
--status=33"

Based on this, a parallel import with 8 workers can be launched through a single command:

ls "$basedir"/list-* | parallel -j 8 $cmd

This invocation will automatically launch manitou-mdx processes and feed them each with a different list of mails to import (through the –import-list={} argument). It will also take care that there are always 8 such running processes if possible, launching a new one when another terminates.

This is very effective, compared to a serial import. Here are the times spent to import to entire mailset (517401 messages) for various degrees of parallelism, on a small server with a Xeon D-1540 @ 2.00GHz processor (8 cores, 16 threads).

 

parallel-mdx

Categories: Usage Tags:

Mass-importing case: the Enron mail database

July 12th, 2016 Comments off

Importing mail messages en masse works best when fiddling a bit with the configuration, rather than pushing the mail messages into the normal feed.

As an example, we’re going to use the mails from Enron, the energy provider that famously went down in the 90s, amidst a fraud scandal.
The mail corpus has been made public by the judicial process:
http://www.cs.cmu.edu/~enron/

It has been cleaned from all attachments, in addition to another cleaning process to remove potentially sensitive personal information, done by Nuix.

The archive format is a 423MB .tar.gz file with an MH-style layout:
– one top-level directory per account.
– inside each account, files and directories with mail folders.

It contains 3500 directories for 151 accounts, and a total of 517401 files, taking 2.6GB on disk once uncompressed.

After unpacking the archive, follow these steps to import the mailset from scratch:

1) Create the list of files


$ cd /data/enron/maildir
$ find . -type f > /data/enron/00-list-all

2) Create a database and a dedicated configuration file for manitou-mdx


# Run this as a user with enough privileges to create
# a database (generally, postgres should do)
$ manitou-mgr --create-database --db-name=enron

Create a specific configuration file with some optimizations for mass import:


$ cat enron-mdx.conf
[common]
db_connect_string = Dbi:Pg:dbname=enron;user=manitou

update_runtime_info = no
update_addresses_last = no
apply_filters = no
index_words = no

preferred_datetime = sender

update_runtime_info is set to no to avoid needlessly update timestamps in the runtime_info table for every imported message.

update_addresses_last set to no also will avoid some unnecessary writes.

apply_filters is again a micro-optimization to avoid querying for filters on every message. On the other hand, it should be left to yes if happen to have defined filters and want them to be used during this import.

index_words is key to performance. Running the full-text indexing after the import instead of during it makes it 3x faster. Also the full-text indexing as a separate process can be parallelized (more on that below).

preferred_datetime set to sender indicates that the date of a message is given by its header Date field, as opposed to the file creation time.

If we were importing into a pre-existing manitou-mdx instance running in the background, we would stop it at this point, as
several instances of manitou-mdx cannot work on the same database because of caching, except in specific circumstances (also more on that later).

3) Run the actual import command


$ cd /data/enron/maildir
$ time manitou-mdx --import-list=../00-list-all --conf=../enron-mdx.conf

On a low-end server, it takes about 70 minutes to import the 517402 messages with this configuration and PostgreSQL 9.5.

We can check with psql that all messages came in:

$ psql -d enron -U manitou
psql (9.5.3)
Type "help" for help.

enron=> select count(*) from mail;
count
--------
517401
(1 row)

4) Run the full text indexing

As it's a new database with no preexisting index, we don't have to worry about existing partitions. We let manitou-mgr index the messages with 4 jobs in parallel:


$ time manitou-mgr --conf=enron-mdx.conf --reindex-full-text --reindex-jobs=4

Output from time:

real 10m41.855s
user 28m22.744s
sys 1m8.476s

So this part of the process takes about 10 minutes.

Conclusion

With manitou-mgr, we can check the final size of the database and its main tables:

$ manitou-mgr --conf=enron-mdx.conf --print-size
-----------------------------------
addresses : 13.52 MB
attachment_contents : 0.02 MB
attachments : 0.02 MB
body : 684.98 MB
header : 402.45 MB
inverted_word_index : 2664.77 MB
mail : 250.12 MB
mail_addresses : 441.17 MB
mail_tags : 0.01 MB
pg_largeobject : 0.01 MB
raw_mail : 0.01 MB
words : 106.52 MB
-----------------------------------
Total database size : 4633 MB

Future posts will show how it compares to the full mailset (with attachments, 18GB of .pst files), and how to parallelize the main import itself.

Categories: Usage Tags:

Attachment uploader reloaded

February 24th, 2012 Comments off

The attachment uploader is a plugin that solves the problem of attachments that are too big to be transferred by mail. It requires no specific action from the person who composes the mail: on the client side, the user attaches the files as usual. This is important because the users are often not aware that sending huge files by e-mail is troublesome.

The upload happens on the server side: before passing outgoing messages to the delivery service, the plugin checks for attachments bigger than the configured size, and when found, transfers them to a web server, under a random-generated directory name. Inside the message, the contents are replaced by the URL pointing to them.

Until recently, the attachments had to be sent by FTP to the web server. What’s new is that it can now be SSH instead, provided that the Net::SFTP::Foreign Perl module is installed on the server.

The new version of the plugin comes with manitou-mdx 1.2.0 and is named attach_uploader_ssh.
(source code on github)

To activate the plugin and connect it to a sender’s identity, it needs to be declared in manitou-mdx configuration file.
Example:

[mymail@domain.tld]
outgoing_plugins = attach_uploader_ssh({host=>"www.myserver.tld", login=>
"sshuser", base_url=>"http://attached.myserver.tld", path=>"/var/www/attached.myserver.tld", maxsize=>2000000})

There is a “password” field that could be used but in this example it is assumed that an SSH key lets the script connect without a password. This is just one of the multiple security options and choices that open up when switching from FTP to SSH.

And here is how a picture larger that 2Mb would appear in the message (as a text/plain part)

The attached file is available at:
http://attached.myserver.tld/JXlafBybAKGrwE5yyOIbKA/PICT34.JPG
Categories: New features, Usage Tags:

Quick resend functionality

November 25th, 2011 Comments off

Sometimes a message that has been previously sent needs to be sent again. The normal way to do that is to recompose a new message by copying the contents of the old one. This leads to a new message with identical contents except for the Date and Message-Id header fields.
However, there is a quicker way to re-send an outgoing message without the need to create a duplicate of the original: if the Sent and Archived bit of the message status are cleared, manitou-mdx will just pick up again the message for sending as if it was new. To clear these bits, use the Message->Properties command and check No in the boxes drawn with the red border in the screenshot:

Categories: Usage, User Interface Tags:

Acting on all tagged messages except some

October 29th, 2011 Comments off

Recently I wanted to reduce the size of my main manitou-mail database, and thus I’ve decided to delete all the messages I’ve received from some mailing-lists. I know these messages are archived elsewhere, so that I could re-import them if needed anyway.
But I didn’t like the idea of deleting also the messages that I’ve sent to these mailing-lists, because it would have broken the rule I’ve adopted of keeping all sent messages. Also on second thought, I thought it would be best to include also the whole threads in which I’ve participated, so that the context of the messages would still be available (BTW, the entire thread to which a message belongs can be recalled in the user interface by the contextual menu command: “Show thread on new page”).

So the question was, how to select all the messages tagged with certain tags, but excluding all threads for which at least one message has the Sent status? As usual, the database and the SQL come to help. First I looked up the tag_id’s of the tags corresponding to the mailing-lists, let’s say they were 3,6 and 10. And then I just expressed in SQL the sentence above. The result is:

SELECT mt.mail_id FROM mail_tags mt JOIN mail m1 USING (mail_id)
 WHERE tag IN (3,6,10) AND NOT EXISTS
 (SELECT 1 FROM mail m2 WHERE m2.thread_id=m1.thread_id AND m2.status&128!=0)

After issuing this query with the selection dialog, with the Limit To field empty to ensure that all messages are retrieved, all that was needed to accomplish the task was to select all the messages in the resulting list (Ctrl-A) and hit the Del key.

Categories: Database, Usage Tags:

Removing unused filters

May 12th, 2011 Comments off

For manitou-mail installations that use a lot of filters, it may be a good idea to check from time to time which ones are still useful and which ones are unused.
Since all the filters are evaluated for each incoming message (except if a stop action is encountered), keeping around a large number of obsolete filters may have an adverse impact on CPU usage.
Fortunately, manitou-mdx gathers statistics on filter hits, so it’s easy to find out which filters no longer generate any hit, with the help of some SQL.
Let’s start with a query that retrieve filters that never had any hit:

SELECT expr_id,name FROM filter_expr LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL;

Now it may be that some of the filters returned by this query are new so that no hit on them occurred yet. We need to filter out these by adding a condition on the last_update field, requesting that the filter hasn’t been modified or created as new since at least 3 months.
Also, we only want entries from filter_expr that have actions tied on them, because filters without actions can be used as sub-expressions (that’s advanced filter usage) and don’t generate any hit.
With these additional conditions, the query becomes:

SELECT DISTINCT expr_id,name FROM filter_action JOIN filter_expr USING (expr_id)
 LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL AND filter_expr.last_update<now()-'3 months'::INTERVAL;

With the query above we can check in advance what we’re about to delete.

Now, the deletion itself needs two steps, one for the filter_action table and another for filter_expr. Since both tables are joined in the query, we need a preliminary step to save the expr_id to delete into a temporay table. The SQL sequence including the transaction is:

BEGIN;
 
CREATE TEMPORARY TABLE del_expr AS
SELECT DISTINCT expr_id FROM filter_action JOIN filter_expr USING (expr_id)
 LEFT JOIN filter_log USING (expr_id)
 WHERE filter_log.expr_id IS NULL
 AND filter_expr.last_update&lt;now()-'3 months'::INTERVAL;
 
DELETE FROM filter_action WHERE expr_id IN (SELECT expr_id FROM del_expr);
 
DELETE FROM filter_expr WHERE expr_id IN (SELECT expr_id FROM del_expr);
 
COMMIT;

To additionally delete the filters that haven’t been used for a significant period of time (for example one year), they could be added to our temporary table before the above deletion:

INSERT INTO del_expr SELECT expr_id
 FROM filter_log GROUP BY expr_id
 HAVING MAX(hit_date)<now()-'1 year'::INTERVAL)

Happy filters cleaning!

Categories: Database, Usage Tags:

Routing outgoing mail in manitou-mdx

October 18th, 2009 Comments off

The default command invoked as the delivery agent for manitou-mdx is `sendmail -f $FROM$ -t` where $FROM$ is replaced by the sender’s email address, which matches whats is called the sender’s identity in Manitou-Mail. On a typical Unix system, this command generally corresponds to the Mail Submission Agent that is installed and responsible for routing the outgoing messages. The sendmail name doesn’t necessarily imply that the MSA is the sendmail SMTP server itself, it can be postfix, or exim, or esmtp, or other programs that have adopted the same name and command line arguments for historical reasons and for the sake of interoperability.
Anyway in manitou-mdx, if this default command is not suitable, the administrator can replace it either globally, or per sender’s identity. A sender’s identity is declared in manitou-mdx configuration file by simply declaring a mailbox with its email address. In the Manitou-Mail user interface, the sender’s identities are configured in the preferences and choosed in the composer.

A typical reason to use different delivery agents when using different identities is that messages may have to be routed to different SMTP servers with different authorization methods. For example, some servers will simply reject messages that have an unexpected From address.
Also there are other cases such as messages from a GMail address that should be routed to a Google SMTP server in order to be signed with a proper DomainKeys header field.
Indeed, while it’s not mandatory, some receivers may pre-sort as spam or reject messages from GMail addresses that are not signed as the Google SMTP servers do with a DomainKey signature (not trusing these messages being the point of DKIM). Let’s see how to route messages written in Manitou-Mail from a GMail address to Google SMTP servers.
I’ve used msmtp for a simple, easy to configure Mail Submission Agent. esmtp is also a candidate but its debian package makes it an alternative to postfix and I happen to want it to supplement postfix, not replace it. msmtp, on the other hand, is a supplementary package that doesn’t conflict with the default MSA.

The procedure to use msmtp is quite simple:
Create a $HOME/.certs directory if none already exists.
Create a $HOME/.msmtprc file (with perm 0600) containing:

# gmail account
auth on
host smtp.gmail.com
port 587
user USERNAME@gmail.com
password XXXXXX
from USERNAME@gmail.com
tls on
tls_trust_file /home/daniel/.cert/ThawtePremiumServerCA.crt

Obviously USERNAME is to be replaced by your GMail login.
The cert file is to be extracted from the set of Thawte certificates available at: https://www.verisign.com/support/thawte-roots.zip, with this command:

unzip -p thawte-roots.zip 'Thawte SSLWeb Server Roots/thawte Premium Server CA/Thawte Premium Server CA.pem' > ~/.certs/ThawtePremiumServerCA.crt

And in manitou-mdx’s configuration file, we have something like:

[common]
# various things
[USERNAME@gmail.com]
local_delivery_agent = msmtp -f $FROM$ -t

UPDATE:

the mentioned certificate is no longer accepted, now we should use Equifax_Secure_CA.crt. I located the file in the debian package named “ca-certificates”, so changing my .smtprc to:

tls_trust_file /usr/share/ca-certificates/mozilla/Equifax_Secure_CA.crt

Categories: Usage Tags:

Face header support

October 7th, 2009 Comments off

While the X-Face header (48×48 BW picture) has been supported for a long time in the Manitou-Mail user interface, the Face header (48×48 color PNG) was not until yesterday.
Now it is, and while testing the code, I’ve found that it was another case where an SQL query quickly solved a practical selection problem. The Face header is indeed not so widely used, so getting a significant sample of different pictures to show is not obvious. Ideally I wanted to extract from my mail archive a gallery of pictures that would be all different. That is, if someone had posted 1000 messages with the same Face header, I wasn’t interested in getting all those messages, only one of them, let’s say the first by it’s ID, and I wanted the next mail in the list to be with a different, non-empty Face, and so on for every message that I wanted to look at. It turns out, that in SQL, it can be expressed with:

SELECT min(mail_id)
FROM header
WHERE position(E'\nFace: ' in lines)>0
GROUP BY
split_part(substr(lines, position(E'\nFace: ' in lines)+7, 1300), E'\n', 1)

position(…) let us know where the Face header field begins, substr(…) extracts a sufficient length of it, and split_part(…) cuts exactly the value at the first newline which marks the end of this header’s value (they’re unfolded in the header table precisely to be able to perform that kind of extraction).
Finally the GROUP BY ensures that each row in the result represents a distinct value of the Face header.

This query can be directly input into the SQL statement field of the Query Selection dialog, after which all there is to do is wait for the database engine to run it to completion.

On my sample database of about 800,000 messages from various mailing lists, it turned out that the result was a list of 176 messages. Here is a collage of a selection of the pictures (public messages only).
face-gallery
Here is how one particular message looks with its Face header:
face-sample-msg
Right now this is just about displaying, sometime in the future I’ll try to add Face headers to outgoing mail, and also I’d like to associate pictures to sender addresses so that messages from people who don’t use a Face header (the majority) still can be shown with a dedicated picture. I feel like even tags or sender domains (which means companies and organizations), could benefit from that kind of visual representation in certain cases.

Categories: New features, Usage, User Interface Tags: