De-duping attachments

Tue Sep 14 23:03:16 EDT 2010

On Wed, Sep 15, 2010 at 12:13:03PM +1000, Rob Mueller wrote:
> 
> > How difficult or easy would it be to modify Cyrus to strip all
> > attachments from emails and store them separately in files? In the
> > message file, replace the attachment with a special tag which will point
> > to the attachment file. Whenever the message is fetched for any reason,
> > the original MIME-encoded message will be re-constructed and delivered.

http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413

2TB - US $109.

> Like anything, doable, but quite a lot of work.

Now de-duping messages on copy is valuable, not so much because of
the space it saves, but because of the IO it saves.  Copying the file
around is expensive.

De-duping componenets of messages and then reconstructing?  Not so much.
You'll be causing MORE IO in general looking for the message, finding the
parts.

The only real benefit I can see is something like replication or a
client that's downloading multiple of these large messages and wants
to save network bandwidth.

Except - there's no protocol to support this for client, so only
replication could gain.

> cyrus likes to mmap the whole file so it can just offset into it to extract 
> which ever part is requested. In IMAP, you can request any arbitrary byte 
> range from the raw RFC822 message using the body[]<start.length> construct, 
> so you have to be able to byte accurately reconstruct the original email if 
> you remove attachments.
> 
> Consider the problem of transfer encoding. Say you have a base64 encoded 
> attachment (which basically all are). When storing and deduping, you'd want 
> to base64 decode it to get the underlying binary data. But depending on the 
> line length of the base64 encoded data, the same file can be encoded in a 
> large number of different ways. When you reconstruct the base64 data, you 
> have to be byte accurate in your reconstruction so your offsets are correct, 
> and so any signing of the message (eg DKIM) isn't broken.
> 
> Once you've solved those problems, the rest is pretty straight forward :)

Yeah, they really aren't so hard to solve.  I didn't actually do the research,
but I have an idea what to do.  Find a big corpus of emails (i.e. FastMail's
one!) and figure out the 10-20 most common base64 widths and surrounding
layouts.  Choose one of those and store it by a single "it's this layout".
If none of them match exactly, store a binary diff from the closest one as
well, it probably won't be very huge.

But in general, I'd say you're optimising the wrong problem.  It's just not
worth it, the savings are minimal and the added complexity is high.  Disk
space is now cheap, and fast access via a cached copy of the email will
beat re-creating the original file from mime parts hands down.

Bron.