giving sync_client the ability to back off and retry locked mailboxes

Wed Jun 29 10:51:56 EDT 2016

On Wed, 2016-06-29 at 10:20 -0400, Giles Malet via Cyrus-devel wrote:
> On Wed, 22 Jun 2016 12:25:28 +1000
> ellie timoney via Cyrus-devel <cyrus-devel at lists.andrew.cmu.edu> wrote:
>  D) Don't add the new sync_action_list.  If any operation returns
> > IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> > let the next run deal with it.
> 
> I meant to comment on this a while ago, and your latest message just reminded me.
> 
> The current sync_client has an awful habit of just quitting when things go wrong ("Bailing out!"). This is not ideal for a system that is trying hard to keep the replica in sync. So we have a script that watches for this happening, and restarts it. A problem though with simply restarting is that whatever caused the bailing is still there, and it will happen again. So we move the old log out the way, do one more try on on that log, then discard it. 
> 
> This way at least most stuff is kept in sync, and replication is still running. We might lose a small amount of changes, but that is preferable to losing a large amount of changes when the client dies.
> 
> This is where we are now. It's not ideal, but mostly works. Separately we have to notice that there was a problem and reconstruct or whatever, and perhaps sync the problematic client.
> 
> Anyhow, hopefully this is something to keep in mind with your latest changes: don't get stuck in a loop if something is corrupted, which does happen sometimes; & don't just quit and lose all changes!

This is something I've been crowing about for a while.  The current
replication logic is such that if one mailbox has a problem replicating
it bails out, leaving every other mailbox on the server in the same
boat.  Without some system of notification in place and manual
intervention to fix things, you could end up with your replica server
having months of data missing with the result being that you'd need to
restore from tape anyway if the primary server dies.

My concern is that there are a lot of people who depend on replication
and don't have any monitoring set up to confirm that it's actually
working.

I had suggested a couple potential improvements:

- Have the replication code automatically reconstruct if a mailbox is
broken.

- Skip the failed mailbox and log it to a "failed" log so a system
administrator can fix it but replication will continue to work.

Of course, it's easy to complain from the sidelines when I could have
just invested this energy into proposing a patch...

--
Dave