De-duping attachments
Rob Mueller
robm at fastmail.fm
Wed Sep 15 00:55:03 EDT 2010
> A 500-user company can easily acquire an email archive of 2-5TB. I don't
> care how much the IO load of that archive server increases, but I'd like
> to reduce disk space utilisation. If the customer can stick to 2TB of
It would be interesting to measure the amount of duplication that is going
on with attachments in emails.
While we could do that with Fastmail data, I think because of the broad
range of users, we'd be getting one data point, which might be quite
different to a data point inside one company. Eg. An architectural firm
might end up sending big blueprint documents back and forth between each
other a lot, so they'd gain a lot from deduplication.
Also even within deduplication, there's some interesting ideas as well. For
instance, if you know the same file is being sent back and forth a lot with
minor changes, you might want to store the most "recent" version, and store
binary diffs between the most recent and old versions (eg xdelta). Yes
accessing the older versions would be much slower (have to get most recent +
apply N deltas), but the space savings could be huge.
> Can we just brainstorm with you and others in this thread... how do we
> re-create a byte-identical attachment from a disk file?
One overall implementation issue. With the message file, do you:
1. Completely rewrite the message file removing the attachments and adding
any extra meta data you want in it's place
2. Leave the message file as exactly the same size, just don't write out the
attachment content and assume your filesystem supports sparse files
(http://en.wikipedia.org/wiki/Sparse_file)
The advantage of 2 is that it leaves the message file size correct, and all
the offsets in the file are still correct. The downsides are that you must
ensure your FS supports sparse files well, and there's the question of where
do you actually store the information that links to the external file?
> - file name/reference
> - full MIME header of the attachment block
I'd leave these intact in the actual message, and just add an extra
X-Detached-File header or something like that includes some external
reference to the file. Hmmm, that'll break signing though. Not so easy...
> - separator string (this will be retained in the message body anyway)
> - transfer encoding
> - if encoding = base64 then
> base64 line length
Remember every line can actually be a different length! In most cases they
will be the same length, but you can't assume it. And you do see messages
that have lines in repeating groups like 76, 76, 76, 76, 74, 76, 76, 76, 76,
74, ... repeat ... or cases like that, a pain to deal with.
> - checksum of encoded attachment (as a sanity check in case the
> re-encoding
> fails to recreate exactly the same image as the original)
This is seeming a bit more tricky...
Rob
More information about the Info-cyrus
mailing list