De-duping attachments
Nik Conwell
nik at bu.edu
Wed Sep 15 07:52:08 EDT 2010
Great thread. Here as some real world numbers based on our spools
here at BU.
One of our masters has 4,800 users, 22,000 mailboxes, and is using about
374G of disk.
Based on the md5 files for these users there are 6,046,363 messages. If
I look at the first md5 value (md5 on the msg if I understand this) and
sort and uniq I get 5,891,974 messages, so assuming we dedup all those
messages that would be a shrink to 97.4% of the original number of
messages. Assuming an even distribution of message sizes this would
mean 374G would drop down to 362.78G. Unfortunately not an obvious huge
win.
But, I think the md5 of the message file includes headers which may be
more likely to be unique over the body content. (Due to legacy support
for UW IMAP, we often end up routing things differently for users on the
same master so the headers for the same message sent to 2 people could
be different).
Isn't the easy hack for dedup just looking at the above md5 files and
then doing appropriate hard links? This could be done by a nightly
trawl of the spool space. A bigger win would be to separate the headers
from the messages but that's a lot more work.
-nik
More information about the Info-cyrus
mailing list