Starting from version 0.9.9, identical attachment contents are automatically shared within the database instead of being duplicated. Actually, the db schema up to now almost allowed to share those contents, since they're referred to by their large object id, but the code to identify duplicates wasn't there yet, as well as the fingerprint column in the attachment_contents table.
The purpose of this article is to show how to find and merge identical attachments in an already existing database (that is, populated prior to 0.9.9), and how much size can be gained with an example on a real production database.
Our database contains about 26,000 messages, the sizes of the main tables (indexes included) being:
----------------------------------- addresses : 23.04 MB attachment_contents : 10.73 MB attachments : 4.52 MB body : 24.95 MB header : 37.27 MB inverted_word_index : 0.02 MB mail : 2.88 MB mail_addresses : 20.21 MB mail_tags : 6.46 MB pg_largeobject : 2328.48 MB raw_mail : 0.02 MB trashed_mail : 6.66 MB words : 7.47 MB ----------------------------------- Total database size : 2504 MBLet's see how many attachments and large objects we have:
$ pql -d cfmail cfmail=# select count(*) from attachments; count ------- 24598 (1 row) cfmail=# select count(distinct loid) from pg_largeobject; count ------- 24563 (1 row)The slight difference between both numbers is due to some attachments having no actual content (content_size=0).
That's done with the recently introduced manitou-mgr command, using the --action=hash-attachments option.
The effect of this action is to update the fingerprint column for all attachment_contents entries for which the fingerprint is unknown (null).
$ manitou-mgr --conf=mdx.conf --action=hash-attachments (...lots of output...)
Also done with manitou-mgr, this time with the --action=merge-attachments option.
The steps performed by this command are: for each set of attachments having
identical fingerprints, delete all the associated large objects except the first
one, and make all attachments of the set point to this one.
Note that no row from attachment_contents is deleted in the
process, but the content column is updated.
$ manitou-mgr --conf=mdx.conf --action=merge-attachments (...lots of output...)
After that, if we want to know how much space has been freed, a vacuum full of the pg_largeobject table is in order, since the potentially massive deletion of these large objects leads to dead space that is not immediately freed. vacuum full will reclaim that space (at the expense of basically locking and rewriting the whole table). Also a reindex is useful, too. We run those commands on pg_largeobject and attachment_contents since the latter has been heavily updated.
$ psql -d cfmail cfmail=# vacuum full analyze pg_largeobject; VACUUM cfmail=# reindex table pg_largeobject; REINDEX cfmail=# vacuum full analyze attachment_contents; VACUUM cfmail=# reindex table attachment_contents; REINDEXThe size of pg_largeobject table is now:
$ psql -d cfmail
cfmail=# select pg_size_pretty(pg_total_relation_size('pg_largeobject'));
pg_size_pretty
----------------
1290 MB
(1 row)
So that's 1.3 GB instead of the initial 2.3 GB. There was 1 GB of
duplicate attachments in this database.
What are these identical attachments? For the curious, a way to find out is to fire up the user interface, choose "Selection/New Query" or hit F2, and enter in the SQL field:
SELECT a.mail_id
FROM attachment_contents ac, attachments a
WHERE a.attachment_id=ac.attachment_id
AND fingerprint in
(SELECT fingerprint from
(SELECT fingerprint,count(*) AS cnt
FROM attachment_contents
GROUP BY fingerprint
HAVING count(*)>1 ORDER BY cnt desc LIMIT 10) AS l)
That query will basically retrieve the messages that have the most shared attachments.
The HAVING count(*)>1 and LIMIT 10 clauses may be tweaked if there are too much results, or not enough (once results are shown, use "Selection/Modify query" to make
the selection dialog reappear).