Potential replica message file corruption/replacement

Bron Gondwana brong at fastmail.fm
Thu Feb 15 21:55:06 EST 2007


Looks innocent, doesn't it...

On Thu, Feb 15, 2007 at 09:57:01AM -0500, murch at andrew.cmu.edu wrote:
> Update of /afs/andrew/system/cvs/src/cyrus/imap
> In directory unix36.andrew.cmu.edu:/var/tmp/cvs-serv18314
> 
> Modified Files:
> 	sync_support.c 
> Log Message:
> unlink() the file in the stage dir before trying to create a new one (Bron Gondwana)
> 
> 
> --- links to diffs follow ---
> http://bugzilla.andrew.cmu.edu/cgi-bin/cvsweb.cgi/src/cyrus/imap/sync_support.c.diff?r1=1.2&r2=1.3

... here's what our cyrus patches page has to say about it:

Our automated check scripts compare message bodies between
master and replica at some very low probability. Imagine
our shock when it found that two messages with exactly the
same index meta data had totally different message bodies
on disk, and the one on the replica didn't even belong to
the same user.

A more thorough audit turned up many more of these files,
and we scratched our heads for days until we realised:

a) we used to have lots of sync_client crashes during the
   early replication rollout.

b) after a crash, the sync_server left temporary files
   sitting around.

c) these files were hardlinked in some cases to files in
   users' folders - depending where the crash happened.

If (and believe me, with a dozen or so separate cyrus 
instances per machine each with many thousands of users,
it happens) another sync_server happened to reuse the same
PID at some point in the future - say a month later - it
would proceed to open those files in mode 'w+', writing
the new file not only to the staging directory BUT ALSO
OVERWRITING A FILE PUT INTO A MAILBOX A MONTH AGO. No
wonder we couldn't see any pattern in the causes of this
problem.

This attached patch attempts to unlink the stage file and
aborts before writing if it can't unlink and the file exists. 

----

I would advise anyone who has been using replication for
any length of time to undertake an audit of the files on
their replicas to ensure that none of them have been
replaced by this, because if you need to "fail over" you
could present users with emails that are not their own.
A simple size check will find almost all cases, compare
what the imapd returns for rfc822.size with the size of
the file on disk.  If you want to get fancy - compute the
sha1 or similar of the file at each end and compare that.

NOTE: a reconstruct of the folder may have caused the
index to have the wrong value, in which case you're left
with content analysis, which is never fun.

Bron.


More information about the Info-cyrus mailing list