User Tools

Site Tools


sql_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revisionBoth sides next revision
sql_analysis [2010/08/12 22:52] – created danielsql_analysis [2016/07/21 04:00] – [Presence and count of specific mail headers] daniel
Line 1: Line 1:
- ====== Presence and count of certain headers ======+ ====== Presence and count of specific mail headers ====== 
 +This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
  
 <code sql> <code sql>
Line 7: Line 8:
 FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field) FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field)
 WHERE position(E'\n'||field IN lines)>0 WHERE position(E'\n'||field IN lines)>0
-group by split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1)+group by 1
 </code> </code>
 +
 +Sample output:
 +<code>
 +            ct             | count 
 +---------------------------+-------
 + Importance: high          |    32
 + Importance: High          |   130
 + Importance: low               1
 + Importance: Medium        |     3
 + Importance: normal        |    44
 + Importance: Normal        |   802
 + Precedence: bulk          | 19987
 + Precedence: bulk          |     1
 + Precedence: fm-user           3
 + Precedence: junk          |     3
 + Precedence: list          |  4026
 + Priority: non-urgent      |    11
 + Priority: Non-Urgent      |     1
 + Priority: normal          |   464
 + Priority: urgent          |   603
 + Priority: Urgent          |    12
 + X-MSMail-Priority: High     121
 + X-MSMail-Priority: Low    |     1
 + X-MSMail-Priority: Lowest |    11
 + X-MSMail-Priority: Medium |     3
 + X-MSMail-Priority: Middle |     8
 + X-MSMail-Priority: Normal |   923
 + X-Priority: 0                 1
 + X-Priority: 1               102
 + X-Priority: 1 (High)      |     2
 + X-Priority: 1 (Highest)     120
 + X-Priority: 2                 6
 + X-Priority: 2 (High)      |     1
 + X-Priority: 3              2892
 + X-Priority: 3 (Normal)    |   635
 + X-Priority: 5                14
 + X-Priority: Normal        |     5
 +</code>
 +
 +This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.
 +<code sql>
 +select FIELD||':'||arr[1], count(*)
 + from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)\n', 'g') as arr
 +       from header, 
 +         (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'),
 +             ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
 +             where strpos(lines,FIELD)>0) l
 +GROUP BY 1 ORDER BY 1
 +</code>
 +
 +In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
 +====== Duplicate messages ======
 +This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
 +<code sql>
 +select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
 +</code>
 +
 +====== Messages with specific attachment types ======
 +To retrieve all messages containing pdf files or any image file:
 +<code sql>
 +select distinct mail_id FROM attachments WHERE content_type='application/pdf' OR content_type like 'image/%';
 +</code>
 +
 +====== Messages sent or received today  ======
 +<code>
 +select mail_id from mail where msg_date>=date_trunc('day',now());
 +</code>
 +
  
sql_analysis.txt · Last modified: 2019/01/24 17:55 by daniel