giving sync_client the ability to back off and retry locked mailboxes

Thu Jun 30 23:16:30 EDT 2016

On a train with limited power and internet, but I strongly disagree with partial repair where some mailboxes don't get synced.

I'm happy with a "failing mailbox gets copied to next sync log file" so long as it can tell that no mailbox in the log file succeeded and back off rather than busy spinning on the same mailbox.

On Fri, Jul 1, 2016, at 11:36, ellie timoney via Cyrus-devel wrote:
> This is veering off topic -- as per my original email, I'm talking
> specifically about locked replication targets (not corrupted, desynced,
> or otherwise broken), and I'm talking specifically about replication to
> a backupd (since no other part of Cyrus replication produces a locked
> mailbox response).
> 
> I'll comment briefly since it's come up here, but if you'd like to
> continue discussing general replication issues, please start a new
> thread (or raise a github issue), to avoid clutter/confusion in this
> thread.
> 
> On Thu, Jun 30, 2016, at 12:20 AM, Giles Malet wrote:
> > On Wed, 22 Jun 2016 12:25:28 +1000
> > ellie timoney via Cyrus-devel <cyrus-devel at lists.andrew.cmu.edu> wrote:
> >  D) Don't add the new sync_action_list.  If any operation returns
> > > IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> > > let the next run deal with it.
> > 
> > I meant to comment on this a while ago, and your latest message just
> > reminded me.
> > 
> > The current sync_client has an awful habit of just quitting when things
> > go wrong ("Bailing out!"). 
> 
> It actually makes a reasonable go at working things out itself, and only
> bails out when all its options are exhausted.
> 
> > This is not ideal for a system that is trying
> > hard to keep the replica in sync. So we have a script that watches for
> > this happening, and restarts it. A problem though with simply restarting
> > is that whatever caused the bailing is still there, and it will happen
> > again. So we move the old log out the way, do one more try on on that
> > log, then discard it. 
> > 
> > This way at least most stuff is kept in sync, and replication is still
> > running. We might lose a small amount of changes, but that is preferable
> > to losing a large amount of changes when the client dies.
> > 
> > This is where we are now. It's not ideal, but mostly works. Separately we
> > have to notice that there was a problem and reconstruct or whatever, and
> > perhaps sync the problematic client.
> 
> The thing is, there will pretty much always be certain classes of
> problems that cannot be resolved automatically and require administrator
> intervention, so there will always be cases where the only thing it can
> safely do is bail out.  And therefore administrators will always need to
> monitor for that in some way, and react appropriately.
> 
> We could hypothetically narrow this list of cases (that's a discussion
> for elsewhere), but the need to notice and react when it does fail for
> some reason won't go away.
> 
> > Anyhow, hopefully this is something to keep in mind with your latest
> > changes: don't get stuck in a loop if something is corrupted, which does
> > happen sometimes; & don't just quit and lose all changes!
> 
> The proposed deferral behaviour only affects replications to backups
> that are locked.  If a backup is corrupted or otherwise broken, in a way
> that sync_client/backupd can't manage to correct automatically, it will
> bail out in exactly the same way as it does for ordinary replications. 
> Deferrals will only occur in cases where we can reasonably expect the
> operation to succeed "in a little while, just not right now".
> 
> > 
> > Thanks for your work.
> > g
> 
> Cheers

-- 
  Bron Gondwana
  brong at fastmail.fm