Home > Usage > Parallel import

Parallel import

July 18th, 2016

Importing in parallel from a single sourceĀ  is really enabledĀ in manitou-mdx since commit 6a860e, under the following conditions:

  • parallelism is driven from the outside: manitou-mdx instances run concurrently, but don’t fork and manage child workers. Workers don’t share anything. Fortunately GNU parallel can easily handle this part.
  • the custom full text indexing is done once the contents are imported, not during the import. The reason is that it absolutely needs a cache for performance, and such a cache wouldn’t work in the share-nothing implementation mentioned above.

The previous post showed how to create a list of all mail files to import from the Enron sample database.

Now instead of that, let’s create a list splitted in chunks of 25k messages, that will be fed separately to the parallel workers:


$ find . -type f | split -d -l 25000 - /data/enron/list-

The result is 21 numbered files of 25000 lines each, except for the last one, list-20 containing 17401 lines.

The main command is essentially the same as before. As a shell variable:

cmd="mdx/script/manitou-mdx --import-list={} \
--import-basedir=$basedir/maildir \
--conf=$basedir/enron-mdx.conf \
--status=33"

Based on this, a parallel import with 8 workers can be launched through a single command:

ls "$basedir"/list-* | parallel -j 8 $cmd

This invocation will automatically launch manitou-mdx processes and feed them each with a different list of mails to import (through the –import-list={} argument). It will also take care that there are always 8 such running processes if possible, launching a new one when another terminates.

This is very effective, compared to a serial import. Here are the times spent to import to entire mailset (517401 messages) for various degrees of parallelism, on a small server with a Xeon D-1540 @ 2.00GHz processor (8 cores, 16 threads).

 

parallel-mdx

Categories: Usage Tags:
Comments are closed.