De-duping attachments

Wed Sep 15 07:52:08 EDT 2010

  Great thread.  Here as some real world numbers based on our spools 
here at BU.

One of our masters has 4,800 users, 22,000 mailboxes, and is using about 
374G of disk.

Based on the md5 files for these users there are 6,046,363 messages.  If 
I look at the first md5 value (md5 on the msg if I understand this) and 
sort and uniq I get 5,891,974 messages, so assuming we dedup all those 
messages that would be a shrink to 97.4% of the original number of 
messages.  Assuming an even distribution of message sizes this would 
mean 374G would drop down to 362.78G.  Unfortunately not an obvious huge 
win.

But, I think the md5 of the message file includes headers which may be 
more likely to be unique over the body content.  (Due to legacy support 
for UW IMAP, we often end up routing things differently for users on the 
same master so the headers for the same message sent to 2 people could 
be different).

Isn't the easy hack for dedup just looking at the above md5 files and 
then doing appropriate hard links?  This could be done by a nightly 
trawl of the spool space.  A bigger win would be to separate the headers 
from the messages but that's a lot more work.

-nik