giving sync_client the ability to back off and retry locked mailboxes
ellie timoney
ellie at fastmail.com
Wed Jun 29 02:19:44 EDT 2016
> D) Don't add the new sync_action_list. If any operation returns
> IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> let the next run deal with it.
>
> [...]
>
> I'm inclining towards D at the moment (less code=>less bugs), even
> though it means throwing away most of my existing implementation and
> starting again. But it depends on sync_log() being okay, which I'm not
> sure about.
So I gave this option a shot, and it works okay in my testing.
Implementation is here:
https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry-take2
I like this implementation a lot more than the last, so I'll push it
upstream in a couple of days if no-one else has an opinion either way.
On Wed, Jun 22, 2016, at 12:25 PM, ellie timoney via Cyrus-devel wrote:
> This might be long, I'm going to front-load a bunch of background
> information to provide context to my eventual questions.
>
> I'm talking primarily about sync_client in rolling aka repeat mode,
> where it processes a channel's sync log ad infinitum. The one-shot
> invocations of sync_client are largely out of scope here.
>
> Also, hereafter when I say "sync_server", I mean both sync_server
> itself, and also the sync commands in imapd when the replication
> capability is enabled, since their underlying implementation is
> literally the same.
>
> In the current state of the world, sync_client has no idea what to do
> with a "NO IMAP_MAILBOX_LOCKED" response. This is generally not a
> problem: sync_server always blocks when obtaining mailbox locks, and so
> it never returns this status. In the pathological case where it blocks
> longer than sync_client is willing to wait (half an hour I believe),
> sync_client will treat the connection as dropped, abort its current run,
> wait for its sync_repeat_interval to elapse, and then process the entire
> aborted sync log again (and so on until it eventually succeeds, and then
> it gets on with the next log, and eventually catches up).
>
> As I understand it, replication targets are generally not expected to be
> doing much of anything, so the chances a destination mailbox is locked
> for long enough to be a problem right at the moment sync_client cares
> about it are pretty slim.
>
> Enter backupd.
>
> backupd functions pretty similarly to sync_server, except of course that
> its backing store is backup files, rather than mailboxes. Backup files
> are per-user, not per-mailbox, which means a broader locking
> granularity, which means more chance of contention. I also expect it's
> just more likely to encounter locked backups:
>
> * The compact process will probably take a while for large users,
> especially if run infrequently
> * An administrator (or tool) might spend a while browsing a user's
> backup to locate content that needs restoring
> * The restore process might take a while for large restorations
> * Someone debugging a backup index might get distracted while holding
> the lock... (sorry, self)
>
> At the moment, backupd has no choice but to block in the same way that
> sync_server does. But that means that, while it's waiting around for
> one user backup to become available, other user's backups are becoming
> stale. This might not be the end of the world, but it's hardly ideal.
>
> What I'd like to do is have backupd open backups in non-blocking mode,
> and immediately return a "NO IMAP_MAILBOX_LOCKED" response if it's
> unable to (this bit is trivial).
>
> At the moment, sync_client treats "NO IMAP_MAILBOX_LOCKED" like any
> other error: it aborts the entire run, waits until its
> sync_repeat_interval elapses, and then starts the aborted run again from
> the top. Any user it didn't get to before hitting the locked one will
> keep getting missed until the locked one succeeds. (And the logs get
> noisy with the errors.)
>
> (Actually, it promotes failed small-granularity syncs, like META,
> MAILBOX, etc up to USER, and then eventually aborts as described if the
> USER level also fails.)
>
> So what I mean is, sync_client is currently -somewhat- able to handle a
> "NO IMAP_MAILBOX_LOCKED" response, but not very gracefully.
>
> I would like to make it recognise and sensibly handle this response --
> effectively skipping over the locked user(s), carrying on with the rest
> of the job, and trying the locked ones again later, when they're
> hopefully no longer locked.
>
> I can see a few ways of implementing this:
>
> A) Add a new sync_action_list for locked users. When a USER fails due
> to IMAP_MAILBOX_LOCKED, add them to the locked users list, and keep
> going as usual. (If any smaller granularity fails, it gets promoted to
> USER.) Re-process the locked users list (with suitable delay time/retry
> limit config) once everything else has been processed. Abort the run as
> usual if the retry limit is reached and there are still unprocessed
> users in the locked list.
>
> B) Same as A, but instead of aborting if the retry limit is hit,
> sync_log() the affected user and succeed. The locked user will be
> handed down from log to log until it eventually succeeds, and in the
> meantime everyone else gets backed up normally.
>
> C) Same as B, but only retry once (not in a loop) before they get
> sync_logged for later processing.
>
> D) Don't add the new sync_action_list. If any operation returns
> IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
> let the next run deal with it.
>
> Option A is already implemented on my github, and seems to work ok:
> https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry
>
> I thought of using sync_log as a solution to the "it still might abort
> on long-held locks" problem, and then thinking through that, saw the
> other options.
>
> I'm not sure of the safety of calling sync_log() from within sync_client
> itself. It *looks* like it should be totally fine, but maybe I've
> missed something?
>
> If calling sync_log() from sync_client is not safe, then A is basically
> the solution, and I just need to polish it a little more and push it up.
>
> But if it is safe, then the other options start looking better: B is
> the most code, but allows us to handle locked backups within the same
> run if the lock turns out to not be held for long. C has a similar
> benefit, and similar amount of code, but won't wait around as long
> before deferring it to the next run. D is (I think) the smallest
> implementation, but any locked backup won't be processed again at all
> for at least sync_repeat_interval, even if it was only locked for a
> moment.
>
> I'm inclining towards D at the moment (less code=>less bugs), even
> though it means throwing away most of my existing implementation and
> starting again. But it depends on sync_log() being okay, which I'm not
> sure about.
>
> Thoughts?
>
> (None of this will have any impact on conventional sync_client =>
> sync_server replication, since sync_server never returns a "NO
> IMAP_MAILBOX_LOCKED" response. It could open up possibilities in the
> future, but I'm not exploring those here.)
More information about the Cyrus-devel
mailing list