Queries: which MUAs are the most popular?

Date: Sat, 2 Sep 2006

This article demonstrates how to extract header fields within an SQL query. As an example, we'll elaborate a query that answers to the question: which mail user agents do most people use?

MUAs generally put their name and version into the optional X-Mailer or User-Agent headers. In the manitou database, the headers are stored as one big chunk of text per message, with header lines separated by newlines. The first thing that has to be worked out is a way to extract the value of a particular header field, using SQL string functions. We're searching for the 'User-Agent: ' substring, followed by a reasonable number of characters, followed by a newline. The messages that contain this header can be retrieved by:

 SELECT mail_id FROM header WHERE position('\nUser-Agent: ' IN lines) > 0

For a given mail_id, the header value can be extracted by way of this expression:

split_part( substr(lines, position('\nUser-Agent: ' in lines)+13, 200), '\n', 1)
(13 is the length of the 'User-Agent: ' string, 200 is an arbitrary maximum size for the field value).
If there are several 'User-Agent' fields in the same header, only the first will be considered.

In addition to 'User-Agent', we need the 'X-Mailer' field. We're selecting both with the following UNION query that returns (mail_id,user_agent) tuples:

  SELECT mail_id,
      split_part(substr(lines, position('\nUser-Agent: ' in lines)+13, 200), '\n', 1)
        AS ua
    FROM header WHERE position('\nUser-Agent: ' in lines) > 0
   UNION ALL
     SELECT mail_id,
       split_part( substr(lines, position('\nX-Mailer: ' in lines)+11, 200), '\n', 1)
          AS ua
    FROM header WHERE position('\nX-Mailer: ' in lines) > 0

Out of a set of messages, we must count only one MUA per sender (as opposed to one per message, otherwise the MUAs used by proficient posters would be over-valued). A reasonable way to do this is selecting only the last message of each different sender from our set of mail. The sender is found in the row of mail_addresses whose addr_type is 1 (=the built-in id of the From address field).

The selection on which we work is constrained by a tag and a lower bound for the message date. Below, we're choosing a psql variable as the tag and the first day of the current year for the minimum date

SELECT ma.addr_id, max(ma.mail_id) AS mail_id FROM mail_tags t, mail_addresses ma, mail m
   WHERE t.tag=:t
   AND ma.mail_id=t.mail_id AND addr_type=1
   AND m.mail_id=t.mail_id AND m.msg_date>=date_trunc('year',now()) GROUP BY ma.addr_id

Finally the two lists are simply joined together by mail_id, then grouped by user agent and counted. Remember that :t is a variable that should be replaced by a tag_id.

SELECT
  ua, count(*) AS cnt
FROM
  (SELECT ma.addr_id, max(ma.mail_id) AS mail_id FROM mail_tags t, mail_addresses ma,
   mail m
   WHERE t.tag=:t AND ma.mail_id=t.mail_id AND addr_type=1
   AND m.mail_id=t.mail_id AND m.msg_date>=date_trunc('year',now()) GROUP BY ma.addr_id) ml,
  (select mail_id,
       split_part( substr(lines, position('\nUser-Agent: ' in lines)+13, 200), '\n', 1)
        AS ua
    FROM header where position('\nUser-Agent: ' in lines)>0
   UNION ALL
     select mail_id,
       split_part( substr(lines, position('\nX-Mailer: ' in lines)+11, 200), '\n', 1)
          AS ua
    FROM header where position('\nX-Mailer: ' in lines)>0
   ) h
WHERE
  ml.mail_id=h.mail_id
GROUP BY ua ORDER BY cnt DESC

Sample results

For the kde-devel mailing-list, unsurprisingly KMail comes out on top:

                 ua                      | cnt
-----------------------------------------+-----
 KMail/1.9.1                             | 132
 KMail/1.9.3                             |  69
 KMail/1.9.4                             |  35
 KMail/1.8.2                             |  14
 KMail/1.8.3                             |  13
 KMail/1.9                               |  11
 KMail/1.8                               |   7
 Mutt/1.5.9i                             |   6
 Mutt/1.5.11                             |   5
 Thunderbird 1.5 (Windows/20051201)      |   4
(total number of distinct MUAs: 93, total number of distinct senders: 606, total number of messages: 3522)

For the general mysql list, Outlook and Thunderbird under Windows seem to be the most popular among posters:

                 ua                      | cnt
-----------------------------------------+-----
 Microsoft Office Outlook 11             |  48
 Thunderbird 1.5.0.5 (Windows/20060719)  |  44
 Thunderbird 1.5 (Windows/20051201)      |  37
 Microsoft Outlook Express 6.00.2900.286 |  32
 Thunderbird 1.5.0.4 (Windows/20060516)  |  25
 Microsoft Office Outlook, Build 11.0.55 |  24
 Mozilla Thunderbird 1.0.7 (Windows/2005 |  20
 Apple Mail (2.752.2)                    |  18
 Thunderbird 1.5 (X11/20051201)          |  16
 Microsoft Outlook Express 6.00.2900.218 |  16
 Thunderbird 1.5.0.2 (Windows/20060308)  |  16
(total number of distinct MUAs: 294, total number of distinct senders: 1590, total number of messages: 8233)
Daniel Vérité, manitou-mail.org