giving sync_client the ability to back off and retry locked mailboxes

Thu Jun 30 21:36:31 EDT 2016

This is veering off topic -- as per my original email, I'm talking
specifically about locked replication targets (not corrupted, desynced,
or otherwise broken), and I'm talking specifically about replication to
a backupd (since no other part of Cyrus replication produces a locked
mailbox response).

I'll comment briefly since it's come up here, but if you'd like to
continue discussing general replication issues, please start a new
thread (or raise a github issue), to avoid clutter/confusion in this
thread.

On Thu, Jun 30, 2016, at 12:20 AM, Giles Malet wrote:
> On Wed, 22 Jun 2016 12:25:28 +1000
> ellie timoney via Cyrus-devel <cyrus-devel at lists.andrew.cmu.edu> wrote:
>  D) Don't add the new sync_action_list.  If any operation returns
> > IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> > let the next run deal with it.
> 
> I meant to comment on this a while ago, and your latest message just
> reminded me.
> 
> The current sync_client has an awful habit of just quitting when things
> go wrong ("Bailing out!"). 

It actually makes a reasonable go at working things out itself, and only
bails out when all its options are exhausted.

> This is not ideal for a system that is trying
> hard to keep the replica in sync. So we have a script that watches for
> this happening, and restarts it. A problem though with simply restarting
> is that whatever caused the bailing is still there, and it will happen
> again. So we move the old log out the way, do one more try on on that
> log, then discard it. 
> 
> This way at least most stuff is kept in sync, and replication is still
> running. We might lose a small amount of changes, but that is preferable
> to losing a large amount of changes when the client dies.
> 
> This is where we are now. It's not ideal, but mostly works. Separately we
> have to notice that there was a problem and reconstruct or whatever, and
> perhaps sync the problematic client.

The thing is, there will pretty much always be certain classes of
problems that cannot be resolved automatically and require administrator
intervention, so there will always be cases where the only thing it can
safely do is bail out.  And therefore administrators will always need to
monitor for that in some way, and react appropriately.

We could hypothetically narrow this list of cases (that's a discussion
for elsewhere), but the need to notice and react when it does fail for
some reason won't go away.

> Anyhow, hopefully this is something to keep in mind with your latest
> changes: don't get stuck in a loop if something is corrupted, which does
> happen sometimes; & don't just quit and lose all changes!

The proposed deferral behaviour only affects replications to backups
that are locked.  If a backup is corrupted or otherwise broken, in a way
that sync_client/backupd can't manage to correct automatically, it will
bail out in exactly the same way as it does for ordinary replications. 
Deferrals will only occur in cases where we can reasonably expect the
operation to succeed "in a little while, just not right now".

> 
> Thanks for your work.
> g

Cheers