giving sync_client the ability to back off and retry locked mailboxes

ellie timoney ellie at fastmail.com
Tue Jun 21 22:25:28 EDT 2016


This might be long, I'm going to front-load a bunch of background
information to provide context to my eventual questions.

I'm talking primarily about sync_client in rolling aka repeat mode,
where it processes a channel's sync log ad infinitum.  The one-shot
invocations of sync_client are largely out of scope here.

Also, hereafter when I say "sync_server", I mean both sync_server
itself, and also the sync commands in imapd when the replication
capability is enabled, since their underlying implementation is
literally the same.

In the current state of the world, sync_client has no idea what to do
with a "NO IMAP_MAILBOX_LOCKED" response.  This is generally not a
problem: sync_server always blocks when obtaining mailbox locks, and so
it never returns this status. In the pathological case where it blocks
longer than sync_client is willing to wait (half an hour I believe),
sync_client will treat the connection as dropped, abort its current run,
wait for its sync_repeat_interval to elapse, and then process the entire
aborted sync log again (and so on until it eventually succeeds, and then
it gets on with the next log, and eventually catches up).

As I understand it, replication targets are generally not expected to be
doing much of anything, so the chances a destination mailbox is locked
for long enough to be a problem right at the moment sync_client cares
about it are pretty slim.

Enter backupd.

backupd functions pretty similarly to sync_server, except of course that
its backing store is backup files, rather than mailboxes.  Backup files
are per-user, not per-mailbox, which means a broader locking
granularity, which means more chance of contention.  I also expect it's
just more likely to encounter locked backups:

  * The compact process will probably take a while for large users,
  especially if run infrequently
  * An administrator (or tool) might spend a while browsing a user's
  backup to locate content that needs restoring
  * The restore process might take a while for large restorations
  * Someone debugging a backup index might get distracted while holding
  the lock... (sorry, self)

At the moment, backupd has no choice but to block in the same way that
sync_server does.  But that means that, while it's waiting around for
one user backup to become available, other user's backups are becoming
stale.  This might not be the end of the world, but it's hardly ideal. 

What I'd like to do is have backupd open backups in non-blocking mode,
and immediately return a "NO IMAP_MAILBOX_LOCKED" response if it's
unable to (this bit is trivial).

At the moment, sync_client treats "NO IMAP_MAILBOX_LOCKED" like any
other error: it aborts the entire run, waits until its
sync_repeat_interval elapses, and then starts the aborted run again from
the top.  Any user it didn't get to before hitting the locked one will
keep getting missed until the locked one succeeds.  (And the logs get
noisy with the errors.)

(Actually, it promotes failed small-granularity syncs, like META,
MAILBOX, etc up to USER, and then eventually aborts as described if the
USER level also fails.)

So what I mean is, sync_client is currently -somewhat- able to handle a
"NO IMAP_MAILBOX_LOCKED" response, but not very gracefully.

I would like to make it recognise and sensibly handle this response --
effectively skipping over the locked user(s), carrying on with the rest
of the job, and trying the locked ones again later, when they're
hopefully no longer locked.

I can see a few ways of implementing this:

A)  Add a new sync_action_list for locked users.  When a USER fails due
to IMAP_MAILBOX_LOCKED, add them to the locked users list, and keep
going as usual.  (If any smaller granularity fails, it gets promoted to
USER.)  Re-process the locked users list (with suitable delay time/retry
limit config) once everything else has been processed.  Abort the run as
usual if the retry limit is reached and there are still unprocessed
users in the locked list.

B) Same as A, but instead of aborting if the retry limit is hit,
sync_log() the affected user and succeed.  The locked user will be
handed down from log to log until it eventually succeeds, and in the
meantime everyone else gets backed up normally.

C) Same as B, but only retry once (not in a loop) before they get
sync_logged for later processing.

D) Don't add the new sync_action_list.  If any operation returns
IMAP_MAILBOX_LOCKED, just sync_log() that operation and continue, and
let the next run deal with it.

Option A is already implemented on my github, and seems to work ok:
https://github.com/elliefm/cyrus-imapd/compare/v30/t245-sync-client-lock-retry

I thought of using sync_log as a solution to the "it still might abort
on long-held locks" problem, and then thinking through that, saw the
other options.

I'm not sure of the safety of calling sync_log() from within sync_client
itself.  It *looks* like it should be totally fine, but maybe I've
missed something?

If calling sync_log() from sync_client is not safe, then A is basically
the solution, and I just need to polish it a little more and push it up.

But if it is safe, then the other options start looking better:  B is
the most code, but allows us to handle locked backups within the same
run if the lock turns out to not be held for long.  C has a similar
benefit, and similar amount of code, but won't wait around as long
before deferring it to the next run.  D is (I think) the smallest
implementation, but any locked backup won't be processed again at all
for at least sync_repeat_interval, even if it was only locked for a
moment.

I'm inclining towards D at the moment (less code=>less bugs), even
though it means throwing away most of my existing implementation and
starting again.  But it depends on sync_log() being okay, which I'm not
sure about.

Thoughts?

(None of this will have any impact on conventional sync_client =>
sync_server replication, since sync_server never returns a "NO
IMAP_MAILBOX_LOCKED" response.  It could open up possibilities in the
future, but I'm not exploring those here.)


More information about the Cyrus-devel mailing list