Mass-importing case: the Enron mail database

Importing mail messages en masse works best when fiddling a bit with the configuration, rather than pushing the mail messages into the normal feed.

As an example, we’re going to use the mails from Enron, the energy provider that famously went down in the 90s, amidst a fraud scandal.

The mail corpus has been made public by the judicial process:

http://www.cs.cmu.edu/~enron/

It has been cleaned from all attachments, in addition to another cleaning process to remove potentially sensitive personal information, done by Nuix.

The archive format is a 423MB .tar.gz file with an MH-style layout:

– one top-level directory per account.

– inside each account, files and directories with mail folders.

It contains 3500 directories for 151 accounts, and a total of 517401 files, taking 2.6GB on disk once uncompressed.

After unpacking the archive, follow these steps to import the mailset from scratch:

1) Create the list of files

$ cd /data/enron/maildir $ find . -type f > /data/enron/00-list-all

2) Create a database and a dedicated configuration file for manitou-mdx

# Run this as a user with enough privileges to create
# a database (generally, postgres should do)
$ manitou-mgr --create-database --db-name=enron

Create a specific configuration file with some optimizations for mass import:

$ cat enron-mdx.conf
[common]
db_connect_string = Dbi:Pg:dbname=enron;user=manitou
update_runtime_info = no
update_addresses_last = no
apply_filters = no
index_words = no
preferred_datetime = sender

update_runtime_info is set to no to avoid needlessly update timestamps in the runtime_info table for every imported message.

update_addresses_last set to no also will avoid some unnecessary writes.

apply_filters is again a micro-optimization to avoid querying for filters on every message. On the other hand, it should be left to yes if happen to have defined filters and want them to be used during this import.

index_words is key to performance. Running the full-text indexing after the import instead of during it makes it 3x faster. Also the full-text indexing as a separate process can be parallelized (more on that below).

preferred_datetime set to sender indicates that the date of a message is given by its header Date field, as opposed to the file creation time.

If we were importing into a pre-existing manitou-mdx instance running in the background, we would stop it at this point, as

several instances of manitou-mdx cannot work on the same database because of caching, except in specific circumstances (also more on that later).

3) Run the actual import command

$ cd /data/enron/maildir
$ time manitou-mdx --import-list=../00-list-all --conf=../enron-mdx.conf

On a low-end server, it takes about 70 minutes to import the 517402 messages with this configuration and PostgreSQL 9.5.

We can check with psql that all messages came in:

$ psql -d enron -U manitou
psql (9.5.3)
Type "help" for help.</p>
<p>enron=> select count(*) from mail;
 count
--------
 517401
(1 row)

4) Run the full text indexing

As it’s a new database with no preexisting index, we don’t have to worry about existing partitions. We let manitou-mgr index the messages with 4 jobs in parallel:

$ time manitou-mgr --conf=enron-mdx.conf --reindex-full-text --reindex-jobs=4

Output from time:

real    10m41.855s
user    28m22.744s
sys     1m8.476s

So this part of the process takes about 10 minutes.

Conclusion

With manitou-mgr, we can check the final size of the database and its main tables:

$ manitou-mgr --conf=enron-mdx.conf --print-size
-----------------------------------
addresses           :    13.52 MB
attachment_contents :     0.02 MB
attachments         :     0.02 MB
body                :   684.98 MB
header              :   402.45 MB
inverted_word_index :  2664.77 MB
mail                :   250.12 MB
mail_addresses      :   441.17 MB
mail_tags           :     0.01 MB
pg_largeobject      :     0.01 MB
raw_mail            :     0.01 MB
words               :   106.52 MB
-----------------------------------
Total database size : 4633 MB

Future posts will show how it compares to the full mailset (with attachments, 18GB of .pst files), and how to parallelize the main import itself.