De-duping attachments

Wed Sep 15 02:30:28 EDT 2010

On Wed, Sep 15, 2010 at 08:40:59AM +0530, Shuvam Misra wrote:
> Dear Rob,
> 
> I had reservations about some of these things too. :( In particular,
> I was wondering about having to remember and recreate the exact
> transfer-encoding. If both of us forward the same attachment in two
> emails, and one encodes in quoted-printable, the other in base64, Cyrus
> had better be able to recreate them exactly or have some other
> workarounds.
> 
> I wasn't aware of the mmap() usage and the direct seeking into the middle
> of the message body. But the bigger problem is what you've described about
> reproducing the message byte-identically. If that can be solved, then we
> can make Cyrus re-create the message while loading from disk and stick it
> into RAM.

There's not actually THAT much parsing of the message body.  I would
guess it's about 9 places:

imap/cyrdump.c
250:	r = mailbox_map_message(state->mailbox, uids[i], &base, &len);

imap/index.c
1013:        if (mailbox_map_message(mailbox, im->record.uid,
1535:    if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) 
2441:	if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) {
2716:    if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size))
3152:	    if (mailbox_map_message(mailbox, im->record.uid,
3337:  if (mailbox_map_message(mailbox, uid, &msgfile.base, &msgfile.size)) {
5112:	if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size))

(those 8 plus one in imap/message.c where it gets parsed originally)

> Can we just brainstorm with you and others in this thread...  how do we
> re-create a byte-identical attachment from a disk file?  What is the list
> of attributes we will need to store per stripped attachment to allow an
> exact re-creation?

I did a bunch of work on this a while back.  Basically for the byte
idential reverse, as I said - keep a list of the most common mapping
functions and try to figure out which one it is algorithmically.
In theory we can work out what the common ones are pretty fast.

>   - file name/reference
> 
>   - full MIME header of the attachment block
> 
>   - separator string (this will be retained in the message body anyway)
> 
>   - transfer encoding

All this stuff I'd keep as a binary diff from the "nearly right"
re-encoding.

>   - if encoding = base64 then
>         base64 line length

Yeah, that's an interesting one.  Assuming it's not totally pathological
there will be some base64 pattern you can find quickly.

>   - checksum of encoded attachment (as a sanity check in case the re-encoding
>     fails to recreate exactly the same image as the original)

We like sha1s.

> If encoding = quoted-printable or uuencode, then don't strip the
> attachment at all.

Makes sense.  There might be some size based logic here too - only
bother applying this on messages over 20k, and where the attachment
is at least 20k in size.  Anything smaller than that is pretty
pointless.

> What other conditions may we need to look for to bypass attachment
> stripping?
> 
> Can we just tap into all of you to get the ideas on paper, even if
> it's not being implemented by anyone right now?  It'll at least help us
> understand the system's internals better.

Sure.  Ideas are good :)  I don't think I'm sold on the value though.
And given that Rob is actually the one who argued me down from
implementing this years ago ;)  But maybe our use case isn't the
same as yours.

Bron.