De-duping attachments

Shuvam Misra shuvam.misra at merceworld.com
Tue Sep 14 21:40:41 EDT 2010


How difficult or easy would it be to modify Cyrus to strip all
attachments from emails and store them separately in files? In the
message file, replace the attachment with a special tag which will point
to the attachment file. Whenever the message is fetched for any reason,
the original MIME-encoded message will be re-constructed and delivered.

If this can be implemented, then the file pointer in the message body
could be its MD5 sum or something similar. This would ensure automatic
de-dup --- if a file with the same MD5 exists, it means I won't store
a second copy --- I'll just point to the existing file.

Today's de-duping of entire messages is a wonderful facility, based
on message-ID. But the problem is that this measure stops halfway --
it does not avoid the enormous duplication when the same JPEG image
of Sandra and the kids, Word doc with sales-forecasts or PDF file is
forwarded by 20 people in 20 separate messages to their friends and
relatives ad infinitum.

At the IMAP or POP protocol levels, no clients would see any change. But
on the server side, the server's disk space usage would drop sharply
and CPU usage would rise somewhat.

One problem I can see is tracking of reference counts to
attachment files. This intelligence would have to be built into the
attachment-stripping layer, and then reference counts would have to
be decremented each time a message file is unlink()ed internally by
imapd, cyr_expire, etc. One simple way-out of this would be to use the
file system itself --- create separate names for each reference to an
attachment file, and hard-link these names to the single instance. Each
message-file which refers to an existing attachment file will have its
own unique reference-name to the attachment. When the message-file is
deleted for any reason by Cyrus, it will also look through all embedded
reference-names, and delete those reference hardlinks too. This means
that if a Cyrus message store is spread across multiple partitions,
one physical copy of each attachment-file will have to be stored in each
partition (potentially), to allow hardlinking from message references.

Shuvam


More information about the Info-cyrus mailing list