Manitou-Mail: parallel antispam

Parallel antispam

Date: Sat, 17 Feb 2007
Author: Daniel Vérité
Applies to: Manitou-Mail, any version

Spamassassin is efficient but not very fast. On installations that have to deal with tons of spam, invoking spamc on one mail at a time through the spamc plugin may be too slow to digest the stream of incoming mail.

This article shows a simple way to parallelize the calls to spamassassin in a filtering stage before the import into the database with manitou-mdx. This front-end is implemented in Perl.

Spamassassin and parallel processing

spamd provides child processes that can work in parallel, their number being configured through the --max-children command line parameter. When this limit is reached, SA will stop analyzing new mail and produce this error message instead (until some childs become available again):

prefork: server reached --max-children setting, consider raising it

So the first thing we need to do is control our degree of parallelism to avoid feeding SA with more messages than its limit. Fortunately we can do this easily since we are decoupled from the flow of incoming mail. We simply maintain a list of our forked processes and wait if we're already at the maximum, until a child has finished.

Moving the spam away

Once a mail is recognized as spam, we choose to move it into a dedicated directory, whose contents are not imported into the database. We could just delete it but we'll leave that to a policy that has to be choosen and implemented by the mail administrator.

Non-spam mailfiles are moved into the 'mailfiles_directory' of our manitou-mdx configuration, so that they'll get imported into the database as soon as possible.

The script

It takes two arguments on the command line: the max number of parallel processes and the directory in which the mailfiles are to be found. The references to the subdirectories "spam" and "notspam" are hardcoded into the source code, and in general the script is intended to be used as a skeleton, although it can also run "as is".

Source code (download parallel-spamass.pl):

use strict;
use POSIX 'WNOHANG';

my $running=0;
my $global_end=0;
my %files;			# pid=>file to process
my $verbose=0;

my $maxproc=$ARGV[0] || die "Usage: $0 max_nb_of_procs directory\n";
my $dir=$ARGV[1] || die  "Usage: $0 max_nb_of_procs directory\n";

sub sigterm {
  $global_end=1;
}

sub aspam {
  my $dir=shift;
  my $file=shift;
  my $pid=fork;
  if ($pid) {
    $files{$pid}=$file;
    $running++;
  }
  else {
    print "spamc -c <$dir/$file\n" if ($verbose);
    exec "spamc -c <$dir/$file";
  }
}

sub process_results {
  my $pid;
  if ($pid=waitpid(-1, WNOHANG)) {
    print "exit $pid: $?\n" if ($verbose);
    $running--;
    my $fname=$files{$pid};
    if (!defined $fname) {
      die "pid=$pid not found in hash\n";
    }
    my $newname;
    if ($?==256) {
      $newname="$dir/spam/$fname";
    }
    else {
      $newname="$dir/notspam/$fname";
    }
    if (!rename("$dir/$fname", $newname)) {
      die "unable to rename $dir/$fname to $newname: $!\n";
    }
    delete $files{$pid};
  }
  else {
    select(undef, undef, undef, 0.25); # sleep 250 ms
  }
}

# main loop
while (!$global_end) {
  opendir(DIR, $dir) || die "Unable to opendir $dir: $!";
  my @files = grep (/^mail-(\d+\-\d+\-\d+)\.received$/, readdir(DIR));
  foreach (@files) {
    if ($running < $maxproc) {
      my $filename=$_;
      print "processing $filename\n" if ($verbose);
      aspam($dir, $filename);
    }
    else {
      process_results;
    }
  }
  while ($running>0 && !$global_end) {
    # wait for all childs to finish to avoid re-selecting
    # files that are being processed currently
    process_results;
  }
  sleep(1) unless ($global_end);
}

The HTML version of the source code was produced with perl2html