Differences

This shows you the differences between two versions of the page.

--- sql_analysis [2010/09/25 01:30] – daniel
+++ sql_analysis [2016/10/04 16:17] – [Duplicate messages] daniel
@@ Line 8: / Line 8: @@
 FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field)
 WHERE position(E'\n'||field IN lines)>0
-group by split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1)
+group by 1
 </code>
@@ Line 48: / Line 48: @@
  X-Priority: Normal        |     5
 </code>
+This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.
+<code sql>
+select FIELD||':'||arr[1], count(*)
+ from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)\n', 'g') as arr
+       from header,
+         (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'),
+             ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
+             where strpos(lines,FIELD)>0) l
+GROUP BY 1 ORDER BY 1
+</code>
+In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
 ====== Duplicate messages ======
 This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
@@ Line 53: / Line 66: @@
 select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
 </code>
+A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
+<code sql>
+select b2.mail_id from body b1, body b2, header h1, header h2
+ where b1.mail_id < b2.mail_id
+ and h1.mail_id = b1.mail_id
+ and h2.mail_id = b2.mail_id
+ and md5(h1.lines) = md5(h2.lines)
+ and md5(b1.bodytext) is not distinct from md5(b2.bodytext)
+ and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml);
+</code>
+The ''IS NOT DISTINCT'' comparator behaves as expected when ''bodytext'' or ''bodyhtml'' is NULL, as opposed
+to the simple equality operator, for which ''NULL=NULL'' is false
+====== Messages with specific attachment types ======
+To retrieve all messages containing pdf files or any image file:
+<code sql>
+select distinct mail_id FROM attachments WHERE content_type='application/pdf' OR content_type like 'image/%';
+</code>
+====== Messages sent or received today  ======
+<code>
+select mail_id from mail where msg_date>=date_trunc('day',now());
+</code>