giving sync_client the ability to back off and retry locked mailboxes

Wed Jun 29 02:19:44 EDT 2016

> D) Don't add the new sync_action_list.  If any operation returns
> IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> let the next run deal with it.
> 
> [...]
> 
> I'm inclining towards D at the moment (less code=>less bugs), even
> though it means throwing away most of my existing implementation and
> starting again.  But it depends on sync_log() being okay, which I'm not
> sure about.

So I gave this option a shot, and it works okay in my testing. 
Implementation is here:
https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry-take2

I like this implementation a lot more than the last, so I'll push it
upstream in a couple of days if no-one else has an opinion either way.

On Wed, Jun 22, 2016, at 12:25 PM, ellie timoney via Cyrus-devel wrote:
> This might be long, I'm going to front-load a bunch of background
> information to provide context to my eventual questions.
> 
> I'm talking primarily about sync_client in rolling aka repeat mode,
> where it processes a channel's sync log ad infinitum.  The one-shot
> invocations of sync_client are largely out of scope here.
> 
> Also, hereafter when I say "sync_server", I mean both sync_server
> itself, and also the sync commands in imapd when the replication
> capability is enabled, since their underlying implementation is
> literally the same.
> 
> In the current state of the world, sync_client has no idea what to do
> with a "NO IMAP_MAILBOX_LOCKED" response.  This is generally not a
> problem: sync_server always blocks when obtaining mailbox locks, and so
> it never returns this status. In the pathological case where it blocks
> longer than sync_client is willing to wait (half an hour I believe),
> sync_client will treat the connection as dropped, abort its current run,
> wait for its sync_repeat_interval to elapse, and then process the entire
> aborted sync log again (and so on until it eventually succeeds, and then
> it gets on with the next log, and eventually catches up).
> 
> As I understand it, replication targets are generally not expected to be
> doing much of anything, so the chances a destination mailbox is locked
> for long enough to be a problem right at the moment sync_client cares
> about it are pretty slim.
> 
> Enter backupd.
> 
> backupd functions pretty similarly to sync_server, except of course that
> its backing store is backup files, rather than mailboxes.  Backup files
> are per-user, not per-mailbox, which means a broader locking
> granularity, which means more chance of contention.  I also expect it's
> just more likely to encounter locked backups:
> 
>   * The compact process will probably take a while for large users,
>   especially if run infrequently
>   * An administrator (or tool) might spend a while browsing a user's
>   backup to locate content that needs restoring
>   * The restore process might take a while for large restorations
>   * Someone debugging a backup index might get distracted while holding
>   the lock... (sorry, self)
> 
> At the moment, backupd has no choice but to block in the same way that
> sync_server does.  But that means that, while it's waiting around for
> one user backup to become available, other user's backups are becoming
> stale.  This might not be the end of the world, but it's hardly ideal. 
> 
> What I'd like to do is have backupd open backups in non-blocking mode,
> and immediately return a "NO IMAP_MAILBOX_LOCKED" response if it's
> unable to (this bit is trivial).
> 
> At the moment, sync_client treats "NO IMAP_MAILBOX_LOCKED" like any
> other error: it aborts the entire run, waits until its
> sync_repeat_interval elapses, and then starts the aborted run again from
> the top.  Any user it didn't get to before hitting the locked one will
> keep getting missed until the locked one succeeds.  (And the logs get
> noisy with the errors.)
> 
> (Actually, it promotes failed small-granularity syncs, like META,
> MAILBOX, etc up to USER, and then eventually aborts as described if the
> USER level also fails.)
> 
> So what I mean is, sync_client is currently -somewhat- able to handle a
> "NO IMAP_MAILBOX_LOCKED" response, but not very gracefully.
> 
> I would like to make it recognise and sensibly handle this response --
> effectively skipping over the locked user(s), carrying on with the rest
> of the job, and trying the locked ones again later, when they're
> hopefully no longer locked.
> 
> I can see a few ways of implementing this:
> 
> A)  Add a new sync_action_list for locked users.  When a USER fails due
> to IMAP_MAILBOX_LOCKED, add them to the locked users list, and keep
> going as usual.  (If any smaller granularity fails, it gets promoted to
> USER.)  Re-process the locked users list (with suitable delay time/retry
> limit config) once everything else has been processed.  Abort the run as
> usual if the retry limit is reached and there are still unprocessed
> users in the locked list.
> 
> B) Same as A, but instead of aborting if the retry limit is hit,
> sync_log() the affected user and succeed.  The locked user will be
> handed down from log to log until it eventually succeeds, and in the
> meantime everyone else gets backed up normally.
> 
> C) Same as B, but only retry once (not in a loop) before they get
> sync_logged for later processing.
> 
> D) Don't add the new sync_action_list.  If any operation returns
> IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> let the next run deal with it.
> 
> Option A is already implemented on my github, and seems to work ok:
> https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry
> 
> I thought of using sync_log as a solution to the "it still might abort
> on long-held locks" problem, and then thinking through that, saw the
> other options.
> 
> I'm not sure of the safety of calling sync_log() from within sync_client
> itself.  It *looks* like it should be totally fine, but maybe I've
> missed something?
> 
> If calling sync_log() from sync_client is not safe, then A is basically
> the solution, and I just need to polish it a little more and push it up.
> 
> But if it is safe, then the other options start looking better:  B is
> the most code, but allows us to handle locked backups within the same
> run if the lock turns out to not be held for long.  C has a similar
> benefit, and similar amount of code, but won't wait around as long
> before deferring it to the next run.  D is (I think) the smallest
> implementation, but any locked backup won't be processed again at all
> for at least sync_repeat_interval, even if it was only locked for a
> moment.
> 
> I'm inclining towards D at the moment (less code=>less bugs), even
> though it means throwing away most of my existing implementation and
> starting again.  But it depends on sync_log() being okay, which I'm not
> sure about.
> 
> Thoughts?
> 
> (None of this will have any impact on conventional sync_client =>
> sync_server replication, since sync_server never returns a "NO
> IMAP_MAILBOX_LOCKED" response.  It could open up possibilities in the
> future, but I'm not exploring those here.)