User Tools

Site Tools


sql_analysis_dedup_attachments

This is an old revision of the document!


Attachments deduplication

Attachments are stored into 3 tables: attachments, attachment_contents and the pg_largeobject system table.

At import time, the SHA1 fingerprint of attachments are computed and compared to contents already in the database. When a match is found, a new reference is created in attachment_contents to the already existing OID (large object reference) instead of importing again the same contents.

This can save significant amounts of disk space, especially when the mail database has accounts within the same organization where documents tend to be sent to multiple persons. Logo pictures in signatures are also good candidates for deduplication.

The following query shows how much total space in bytes is saved from attachments deduplication:

SELECT SUM(sz) FROM (SELECT fingerprint,content_size*(COUNT(*)-1) AS sz
 FROM attachment_contents ac JOIN attachments a USING(attachment_id)
 GROUP BY fingerprint,content_size
 HAVING COUNT(*)>1) s1;

This query retrieves, within the top 30 of the most deduplicated attachments (in number of occurrences), the count and sizes of attachments with file names and a significant size:

SELECT filename,COUNT(*),content_size
 FROM attachments a JOIN attachment_contents ac USING(attachment_id)
 JOIN (SELECT fingerprint,COUNT(*) AS cnt FROM attachment_contents
    GROUP BY fingerprint ORDER BY 2 DESC LIMIT 30) s1
     USING(fingerprint)
  WHERE content_size>50000
  GROUP BY filename,content_size;
sql_analysis_dedup_attachments.1409605294.txt.gz · Last modified: 2014/09/01 21:01 by daniel