De-duping attachments

Shuvam Misra shuvam.misra at merceworld.com
Tue Sep 14 23:10:59 EDT 2010


Dear Rob,

I had reservations about some of these things too. :( In particular,
I was wondering about having to remember and recreate the exact
transfer-encoding. If both of us forward the same attachment in two
emails, and one encodes in quoted-printable, the other in base64, Cyrus
had better be able to recreate them exactly or have some other
workarounds.

I wasn't aware of the mmap() usage and the direct seeking into the middle
of the message body. But the bigger problem is what you've described about
reproducing the message byte-identically. If that can be solved, then we
can make Cyrus re-create the message while loading from disk and stick it
into RAM.

Can we just brainstorm with you and others in this thread...  how do we
re-create a byte-identical attachment from a disk file?  What is the list
of attributes we will need to store per stripped attachment to allow an
exact re-creation?

  - file name/reference

  - full MIME header of the attachment block

  - separator string (this will be retained in the message body anyway)

  - transfer encoding

  - if encoding = base64 then
        base64 line length

  - checksum of encoded attachment (as a sanity check in case the re-encoding
    fails to recreate exactly the same image as the original)

If encoding = quoted-printable or uuencode, then don't strip the
attachment at all.

What other conditions may we need to look for to bypass attachment
stripping?

Can we just tap into all of you to get the ideas on paper, even if
it's not being implemented by anyone right now?  It'll at least help us
understand the system's internals better.

thanks a lot, and regards,
Shuvam

> cyrus likes to mmap the whole file so it can just offset into it to
> extract which ever part is requested. In IMAP, you can request any
> arbitrary byte range from the raw RFC822 message using the
> body[]<start.length> construct, so you have to be able to byte
> accurately reconstruct the original email if you remove attachments.
> 
> Consider the problem of transfer encoding. Say you have a base64
> encoded attachment (which basically all are). When storing and
> deduping, you'd want to base64 decode it to get the underlying
> binary data. But depending on the line length of the base64 encoded
> data, the same file can be encoded in a large number of different
> ways. When you reconstruct the base64 data, you have to be byte
> accurate in your reconstruction so your offsets are correct, and so
> any signing of the message (eg DKIM) isn't broken.
> 
> Once you've solved those problems, the rest is pretty straight forward :)
> 
> Rob
> 


More information about the Info-cyrus mailing list