What would it take for FastMail to run murder

Wed Mar 18 07:24:59 EDT 2015

On Wed, Mar 18, 2015, at 09:49 PM, Jeroen van Meeuwen (Kolab Systems) wrote:
> On 2015-03-18 01:51, Bron Gondwana wrote:
> > On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) 
> > wrote:
> >> We promote a standby frontend not otherwise used, to become the new
> >> mupdate server. The interruption is a matter of seconds this way,
> >> unless of course you're in the typical stalemate.
> >
> > Hmm.... so maybe it's affordable.  It scales up with number-of-
> > servers as well though.  Making sure it's up to date costs at least
> > O(number of backends).
> >
>
> I suppose in your specific case, which I'm not at all too familiar
> with, perhaps enhancing murder/mupdate to allow cascading and/or (geo-
> based) replication and/or multi-master would serve your deployment yet
> even better?

Hmm, yeah - geo updates and mailboxes.db changes.  I'm not super-
concerned that it's a slightly slow path - they are rare.  Might suck if
you're making a ton of changes all at once - but that should be OK too -
just make all the changes locally and then blat the whole lot in a
single transaction to the murder DB.

Or hell, make it eventually consistent.  All you need is a zookeeper-
style way to anoint one server as the owner of each fragment of
namespace.  So you can only create a new user's mailbox in one place at
once, and then every user can only create mailboxes on their home
server.  Stop clashes from ever forming that way.

There are safe ways to do this that aren't a single mupdate master
(which already sucks when you're geographically distributed I'm sure -
ask CMU, I'm pretty sure they are running it globally)

> > I'm not 100% happy living with unaddressed race conditions.
> > Addressing this would be an important part of making FastMail happy
> > to run it.
> >
>
> Ghe, neither am I, but c'est la vie.
>
> That said, in ~5 years and dozens of murder deployments, I have yet to
> encounter a situation or even a support case in which one mailbox is
> -- accidentally or otherwise -- created in two locations without the
> second failing / being rejected.

Yeah, it's a rare case because normal users can't do it, and at least in
our setup, the user creation itself is brokered through a singleton
database, and the location to create the user is calculated then.

> > 1) I'm running cmd_create in an IMAPd and I want to see if this
> >    folder already exists.
> >
> > 2) I'm a replica backend getting a copy of an existing folder (or
> >    indeed, a backend which already has a folder) and I'm informing
> >    mupdate of the fact.
> >
> > Those two should be treated differently.  The first is "does this
> > already exist", which is a legitimate question to ask.  The second
> > should always succeed. MUPDATE is a representation of facts, and the
> > backends are the masters of those facts.
> >
>
> With two-way replication safety however (and in your case, channelled
> as well, right?), which end of the replication (just in case things
> end up load-balanced across replicas?) gets to submit the original
> cmd_set() is up in the air, no?

Er, not really.  Worst case they both do and you resolve them, RIAK
style, when they discover each other.

> > So they both have the same server name in mupdate.
> >
>
> Yes, and frontends proxy the connections for mailboxes on the backend
> to the same fake server address.

Yeah, of course.  We used to do this with FastMail - failover IP - but
it doesn't work across datacentres, so instead we have a source of truth
(a DB backed daemon for now, but consul soon) which says where the
master is right now, and nginx just connects directly to the slot IP for
the master end - so we can proxy to a different datacentre
transparently.

> > It may even have statuscache.  Man, how awesome would distributed
> > statuscache be.
> >
> > So there are multiple records for the same mailbox, with different
> > server names, in the murder DB.
> >
>
> Would this not open back up a route to entertaining a variety of race
> conditions (that would need to be addressed somehow) though?

Not really - because writes are always sourced from the backend you are
connected to.  What it COULD create, in theory, is stale reads - but
only stale in the way that if would have been if you'd done the same
read a second ago.  IMAP makes no guarantees about parallel connections.

> Should then one of the duplicate mailboxes be marked as the primary?

Of course.  But not in mailboxes.db itself, separately with either a
per-server scoping or a per-user/nameroot scoping.  There are arguments
for both, per nameroot is a lot more data, particularly to update in a
failover case, but it also allows you to do really amazing stuff like
have per-user replicas configured directly with annotations or
something - such that any one user can be moved to a set of machines
within the murder and there's no need to actually define "pairs" of
machines at all.

I'd almost certainly have an external process monitor that though,
monitor disk usage, user size, user locations, etc - and rebalance users
by issuing the correct commands to change replication settings.

> A scenario that comes up often is the geographically close-yet-distant
> secondary site for disaster recovery, where a set of backends on the
> primary site replicate to a set of backends on the secondary site.
> While initially this succeeds perfectly fine, and the backends on the
> secondary site can participate in a local-to-site murder, transferring
> mailboxes from one backend to another on the primary site will fail to
> replicate to the secondary site's backends (because of their
> participation in the murder).

Yeah, so that's something I want fixed for sure.

> This is in part because it is not the XFER being replicated as such,
> but the target backend's CREATE/cmd_set(), which will fail because the
> mailbox already resides on another backend.
>
> I suppose a scenario in which the mupdate master is in fact able to
> hold multiple records for the same mailbox might also allow us to
> overcome this conundrum?

Now here's a thing that Ken didn't address that we do in our systems. We
have 3 copies, 2 in New York, one in Thor- and we want to move a user
from store a to store b.  Servers are called (short names for clarity)

na1 na2 ta1 nb1 nb2 tb2

Masters are na1 and nb1.

We run an initial set of syncs as follows: na2 => nb1, na2 => nb2,
ta1 => tb1

(notice it's from replicas, and from same datacentre) meaning that the
bulk of the data transfer is done on local networks.

We then run

na1 => nb1, na1 => nb2, na1 => tb1

Which guarantees up-to-date from the master.

We then lock the user down entirely, no delivery, no logins, and run:

na1 => nb1, na1 => nb2, na1 => tb1

... one final time to ensure everything is synced.

BTW Ellie - there's going to be a questionaire on this one at the end of
the week, just to make sure you're reading all my emails!

Ken's stuff does all that for just the one destination server, but not
the locality based sync in the first case or the multiple destinations.
I'm not even sure how that would look in a murder setup. You'd have to
know what the replicas were, and tell each one to go sync from its
"closest" source, however you define that.

And THEN we update the DB to say that store is b, and release the locks.
Total lock time is usually a handful of seconds.

(I skipped the bit where after the first big sync we run a full index on
the user and archive the index so search is all in place...)

> Such mbox list daemon solely for efficient lookups -- it would not
> need persistent storage at all would it? If so, would memcache be a
> consideration for such volatile key-value storage? The key then being
> storing what one wants as many times as is needed for lookups to be as
> efficient as possible, but it can possibly be shared over the network,
> and replicated memcached could be used for redundancy.

Almost certainly it would be fine to whack in something like memcached.
But not memcached, because you need the ability to iterate and the
secondary index mapping username => visible folders.

> However, perhaps memcache is not so much suitable for the multiple-records-per-
> mailbox scenario.

Yeah, that.

> Yes, like I said on the IRC channel -- I'm scared of SQLite in this
> context, but if you're electing to make the change, since you're also
> on the receiving end of the angry pager, who am I to argue? ;-)

:)

I'm not convinced by sqlite either, because I want clustering on
this thing.

> My patch was an ugly workaround and not a proper solution regardless.

Heh.  Ugly workaround that works beats grand vision that doesn't exists
yet any day of the week.

-- 
  Bron Gondwana
  brong at fastmail.fm