====== Presence and count of specific mail headers ====== This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. SELECT split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1) as ct, count(*) FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field) WHERE position(E'\n'||field IN lines)>0 group by 1 Sample output: ct | count ---------------------------+------- Importance: high | 32 Importance: High | 130 Importance: low | 1 Importance: Medium | 3 Importance: normal | 44 Importance: Normal | 802 Precedence: bulk | 19987 Precedence: bulk | 1 Precedence: fm-user | 3 Precedence: junk | 3 Precedence: list | 4026 Priority: non-urgent | 11 Priority: Non-Urgent | 1 Priority: normal | 464 Priority: urgent | 603 Priority: Urgent | 12 X-MSMail-Priority: High | 121 X-MSMail-Priority: Low | 1 X-MSMail-Priority: Lowest | 11 X-MSMail-Priority: Medium | 3 X-MSMail-Priority: Middle | 8 X-MSMail-Priority: Normal | 923 X-Priority: 0 | 1 X-Priority: 1 | 102 X-Priority: 1 (High) | 2 X-Priority: 1 (Highest) | 120 X-Priority: 2 | 6 X-Priority: 2 (High) | 1 X-Priority: 3 | 2892 X-Priority: 3 (Normal) | 635 X-Priority: 5 | 14 X-Priority: Normal | 5 This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length. select FIELD||':'||arr[1], count(*) from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') as arr from header, (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD) where strpos(lines,FIELD)>0) l GROUP BY 1 ORDER BY 1 In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields. ====== Duplicate messages ====== This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate. select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers: select b2.mail_id from body b1, body b2, header h1, header h2 where b1.mail_id < b2.mail_id and h1.mail_id = b1.mail_id and h2.mail_id = b2.mail_id and md5(h1.lines) = md5(h2.lines) and md5(b1.bodytext) is not distinct from md5(b2.bodytext) and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml); The ''IS NOT DISTINCT'' comparator behaves as expected when ''bodytext'' or ''bodyhtml'' is NULL, as opposed to the simple equality operator, for which ''NULL=NULL'' is false ====== Hierarchical view of tags ====== Output a list of tags sorted by hierarchy level and names, with an indentation to represent the hierarchy. WITH RECURSIVE tagr(a,_tag_id,name,level) as ( select array[row_number() over (order by name)] as a, tag_id, name, 1 as level from tags where parent_id is null UNION ALL select tagr.a || row_number() over (order by tags.name), tag_id, tags.name, tagr.level+1 FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id ) select repeat(' ', level-1) || name from tagr order by a;