Dealing with duplicate attachments

Date: Thu, 13 Dec 2007
Author: Daniel Vérité
Applies to: Manitou-Mail, 0.9.9

Starting from version 0.9.9, identical attachment contents are automatically shared within the database instead of being duplicated. Actually, the db schema up to now almost allowed to share those contents, since they're referred to by their large object id, but the code to identify duplicates wasn't there yet, as well as the fingerprint column in the attachment_contents table.

The purpose of this article is to show how to find and merge identical attachments in an already existing database (that is, populated prior to 0.9.9), and how much size can be gained with an example on a real production database.

Initial database size

Our database contains about 26,000 messages, the sizes of the main tables (indexes included) being:

-----------------------------------
addresses           :    23.04 MB
attachment_contents :    10.73 MB
attachments         :     4.52 MB
body                :    24.95 MB
header              :    37.27 MB
inverted_word_index :     0.02 MB
mail                :     2.88 MB
mail_addresses      :    20.21 MB
mail_tags           :     6.46 MB
pg_largeobject      :  2328.48 MB
raw_mail            :     0.02 MB
trashed_mail        :     6.66 MB
words               :     7.47 MB
-----------------------------------
Total database size :     2504 MB
Let's see how many attachments and large objects we have:
$ pql -d cfmail
cfmail=# select count(*) from attachments;
 count 
-------
 24598
(1 row)

cfmail=# select count(distinct loid) from pg_largeobject;
 count 
-------
 24563
(1 row)
The slight difference between both numbers is due to some attachments having no actual content (content_size=0).

Computing the fingerprints

That's done with the recently introduced manitou-mgr command, using the --action=hash-attachments option.
The effect of this action is to update the fingerprint column for all attachment_contents entries for which the fingerprint is unknown (null).

$ manitou-mgr --conf=mdx.conf --action=hash-attachments
(...lots of output...)

Merging identical attachments

Also done with manitou-mgr, this time with the --action=merge-attachments option.
The steps performed by this command are: for each set of attachments having identical fingerprints, delete all the associated large objects except the first one, and make all attachments of the set point to this one.
Note that no row from attachment_contents is deleted in the process, but the content column is updated.

$ manitou-mgr --conf=mdx.conf --action=merge-attachments
(...lots of output...)

Results

After that, if we want to know how much space has been freed, a vacuum full of the pg_largeobject table is in order, since the potentially massive deletion of these large objects leads to dead space that is not immediately freed. vacuum full will reclaim that space (at the expense of basically locking and rewriting the whole table). Also a reindex is useful, too. We run those commands on pg_largeobject and attachment_contents since the latter has been heavily updated.

$ psql -d cfmail
cfmail=# vacuum full analyze pg_largeobject;
VACUUM
cfmail=# reindex table pg_largeobject;
REINDEX
cfmail=# vacuum full analyze attachment_contents;
VACUUM
cfmail=# reindex table attachment_contents;
REINDEX
The size of pg_largeobject table is now:
$ psql -d cfmail
cfmail=# select pg_size_pretty(pg_total_relation_size('pg_largeobject'));
 pg_size_pretty 
----------------
 1290 MB
(1 row)
So that's 1.3 GB instead of the initial 2.3 GB. There was 1 GB of duplicate attachments in this database.

Looking at the duplicates

What are these identical attachments? For the curious, a way to find out is to fire up the user interface, choose "Selection/New Query" or hit F2, and enter in the SQL field:

SELECT a.mail_id
 FROM attachment_contents ac, attachments a
  WHERE a.attachment_id=ac.attachment_id
 AND fingerprint in
  (SELECT fingerprint from
   (SELECT fingerprint,count(*) AS cnt
     FROM attachment_contents
     GROUP BY fingerprint
      HAVING count(*)>1 ORDER BY cnt desc LIMIT 10) AS l)
That query will basically retrieve the messages that have the most shared attachments. The HAVING count(*)>1 and LIMIT 10 clauses may be tweaked if there are too much results, or not enough (once results are shown, use "Selection/Modify query" to make the selection dialog reappear).
Vcard files (.vcf), company logos, attached signatures appear to be good to be merged together, but there may also be occurrences of bigger files such as reference documents, photos, files that are sent over and over.