This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
sql_analysis [2010/08/13 00:52] daniel created |
sql_analysis [2019/01/24 18:55] daniel [Presence and count of specific mail headers] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Presence and count of certain headers ====== | + | ====== Presence and count of specific mail headers ====== |
+ | This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. | ||
<code sql> | <code sql> | ||
Line 7: | Line 8: | ||
FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field) | FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field) | ||
WHERE position(E'\n'||field IN lines)>0 | WHERE position(E'\n'||field IN lines)>0 | ||
- | group by split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1) | + | group by 1 |
</code> | </code> | ||
+ | |||
+ | Sample output: | ||
+ | <code> | ||
+ | ct | count | ||
+ | ---------------------------+------- | ||
+ | Importance: high | 32 | ||
+ | Importance: High | 130 | ||
+ | Importance: low | 1 | ||
+ | Importance: Medium | 3 | ||
+ | Importance: normal | 44 | ||
+ | Importance: Normal | 802 | ||
+ | Precedence: bulk | 19987 | ||
+ | Precedence: bulk | 1 | ||
+ | Precedence: fm-user | 3 | ||
+ | Precedence: junk | 3 | ||
+ | Precedence: list | 4026 | ||
+ | Priority: non-urgent | 11 | ||
+ | Priority: Non-Urgent | 1 | ||
+ | Priority: normal | 464 | ||
+ | Priority: urgent | 603 | ||
+ | Priority: Urgent | 12 | ||
+ | X-MSMail-Priority: High | 121 | ||
+ | X-MSMail-Priority: Low | 1 | ||
+ | X-MSMail-Priority: Lowest | 11 | ||
+ | X-MSMail-Priority: Medium | 3 | ||
+ | X-MSMail-Priority: Middle | 8 | ||
+ | X-MSMail-Priority: Normal | 923 | ||
+ | X-Priority: 0 | 1 | ||
+ | X-Priority: 1 | 102 | ||
+ | X-Priority: 1 (High) | 2 | ||
+ | X-Priority: 1 (Highest) | 120 | ||
+ | X-Priority: 2 | 6 | ||
+ | X-Priority: 2 (High) | 1 | ||
+ | X-Priority: 3 | 2892 | ||
+ | X-Priority: 3 (Normal) | 635 | ||
+ | X-Priority: 5 | 14 | ||
+ | X-Priority: Normal | 5 | ||
+ | </code> | ||
+ | |||
+ | This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length. | ||
+ | <code sql> | ||
+ | select FIELD||':'||arr[1], count(*) | ||
+ | from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') as arr | ||
+ | from header, | ||
+ | (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), | ||
+ | ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD) | ||
+ | where strpos(lines,FIELD)>0) l | ||
+ | GROUP BY 1 ORDER BY 1 | ||
+ | </code> | ||
+ | |||
+ | In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields. | ||
+ | ====== Duplicate messages ====== | ||
+ | This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate. | ||
+ | <code sql> | ||
+ | select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id | ||
+ | </code> | ||
+ | |||
+ | A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers: | ||
+ | <code sql> | ||
+ | select b2.mail_id from body b1, body b2, header h1, header h2 | ||
+ | where b1.mail_id < b2.mail_id | ||
+ | and h1.mail_id = b1.mail_id | ||
+ | and h2.mail_id = b2.mail_id | ||
+ | and md5(h1.lines) = md5(h2.lines) | ||
+ | and md5(b1.bodytext) is not distinct from md5(b2.bodytext) | ||
+ | and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml); | ||
+ | </code> | ||
+ | |||
+ | The ''IS NOT DISTINCT'' comparator behaves as expected when ''bodytext'' or ''bodyhtml'' is NULL, as opposed | ||
+ | to the simple equality operator, for which ''NULL=NULL'' is false | ||
+ | |||
+ | ====== Hierarchical view of tags ====== | ||
+ | |||
+ | Output a list of tags sorted by hierarchy level and names, | ||
+ | with an indentation to represent the hierarchy. | ||
+ | |||
+ | <code sql> | ||
+ | WITH RECURSIVE tagr(a,_tag_id,name,level) as ( | ||
+ | select | ||
+ | array[row_number() over (order by name)] as a, | ||
+ | tag_id, | ||
+ | name, | ||
+ | 1 as level | ||
+ | from tags where parent_id is null | ||
+ | UNION ALL | ||
+ | select | ||
+ | tagr.a || row_number() over (order by tags.name), | ||
+ | tag_id, | ||
+ | tags.name, | ||
+ | tagr.level+1 | ||
+ | FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id | ||
+ | ) | ||
+ | select repeat(' ', level-1) || name from tagr order by a; | ||
+ | |||
+ | </code> | ||
+ | |||