De-duping attachments

Rob Mueller robm at fastmail.fm
Wed Sep 15 00:55:03 EDT 2010


> A 500-user company can easily acquire an email archive of 2-5TB. I don't
> care how much the IO load of that archive server increases, but I'd like
> to reduce disk space utilisation. If the customer can stick to 2TB of

It would be interesting to measure the amount of duplication that is going 
on with attachments in emails.

While we could do that with Fastmail data, I think because of the broad 
range of users, we'd be getting one data point, which might be quite 
different to a data point inside one company. Eg. An architectural firm 
might end up sending big blueprint documents back and forth between each 
other a lot, so they'd gain a lot from deduplication.

Also even within deduplication, there's some interesting ideas as well. For 
instance, if you know the same file is being sent back and forth a lot with 
minor changes, you might want to store the most "recent" version, and store 
binary diffs between the most recent and old versions (eg xdelta). Yes 
accessing the older versions would be much slower (have to get most recent + 
apply N deltas), but the space savings could be huge.

> Can we just brainstorm with you and others in this thread...  how do we
> re-create a byte-identical attachment from a disk file?

One overall implementation issue. With the message file, do you:

1. Completely rewrite the message file removing the attachments and adding 
any extra meta data you want in it's place
2. Leave the message file as exactly the same size, just don't write out the 
attachment content and assume your filesystem supports sparse files 
(http://en.wikipedia.org/wiki/Sparse_file)

The advantage of 2 is that it leaves the message file size correct, and all 
the offsets in the file are still correct. The downsides are that you must 
ensure your FS supports sparse files well, and there's the question of where 
do you actually store the information that links to the external file?

>  - file name/reference
>  - full MIME header of the attachment block

I'd leave these intact in the actual message, and just add an extra 
X-Detached-File header or something like that includes some external 
reference to the file. Hmmm, that'll break signing though. Not so easy...

>  - separator string (this will be retained in the message body anyway)
>  - transfer encoding
>  - if encoding = base64 then
>        base64 line length

Remember every line can actually be a different length! In most cases they 
will be the same length, but you can't assume it. And you do see messages 
that have lines in repeating groups like 76, 76, 76, 76, 74, 76, 76, 76, 76, 
74, ... repeat ... or cases like that, a pain to deal with.

>  - checksum of encoded attachment (as a sanity check in case the 
> re-encoding
>    fails to recreate exactly the same image as the original)

This is seeming a bit more tricky...

Rob



More information about the Info-cyrus mailing list