What would it take for FastMail to run murder

Wed Mar 18 06:49:45 EDT 2015

On 2015-03-18 01:51, Bron Gondwana wrote:
> On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) 
> wrote:
>> We promote a standby frontend not otherwise used, to become the new
>> mupdate server. The interruption is a matter of seconds this way,
>> unless of course you're in the typical stalemate.
> 
> Hmm.... so maybe it's affordable.  It scales up with number-of-servers
> as well though.  Making sure it's up to date costs at least O(number of
> backends).
> 

I suppose in your specific case, which I'm not at all too familiar with, 
perhaps enhancing murder/mupdate to allow cascading and/or (geo-based) 
replication and/or multi-master would serve your deployment yet even 
better?

I'm suggesting so because I would be concerned with the round-trip times 
between datacenters if there were only one mupdate master across all -- 
and perhaps the replicas are faster in issuing the cmd_set() than the 
mupdate master is(?).

>> > Interesting.  Does it also handle the case where the same mailbox
>> > gets accidentally created on two servers which aren't replica pairs
>> > though? Or do you get a mailbox fork?
>> >
>> 
>> The race condition is not addressed with it, like it is not addressed
>> currently.
> 
> I'm not 100% happy living with unaddressed race conditions.  Addressing
> this would be an important part of making FastMail happy to run it.
> 

Ghe, neither am I, but c'est la vie.

That said, in ~5 years and dozens of murder deployments, I have yet to 
encounter a situation or even a support case in which one mailbox is -- 
accidentally or otherwise -- created in two locations without the second 
failing / being rejected.

>> It solely makes the MUPDATE server not reject the reservation
>> request from a server that uses the same "servername" if it already
>> has an entry for the same "servername!partition", so that the
>> replica successfully creates the local copy -- after which
>> replication is happy.
> 
> Yeah, that makes sense.  Of course, the backend should probably not be
> "reserving" so much.  There are two things conflated here:
> 
> 1) I'm running cmd_create in an IMAPd and I want to see if this folder
>    already exists.
> 
> 2) I'm a replica backend getting a copy of an existing folder (or
>    indeed, a backend which already has a folder) and I'm informing
>    mupdate of the fact.
> 
> Those two should be treated differently.  The first is "does this
> already exist", which is a legitimate question to ask.  The second
> should always succeed. MUPDATE is a representation of facts, and the
> backends are the masters of those facts.
> 

With two-way replication safety however (and in your case, channelled as 
well, right?), which end of the replication (just in case things end up 
load-balanced across replicas?) gets to submit the original cmd_set() is 
up in the air, no?

>> So this would build a scenario in which:
>> 
>>    "pair-1-replica-1.example.org" and "pair-1-replica-2.example.org"
>>    present themselves as "pair-1.example.org"
>> 
>>    A DNS IN A RR is created for the fail-over address(es) for "pair-
>>    1.example.org" and attached to whichever replica in the pair is
>>    considered the active node.
>> 
>> Both replicas would be configured to replicate to one another, which
>> works in a PoC scenario but may seem to require lmtpd/AF_INET
>> delivery.
> 
> So they both have the same server name in mupdate.
> 

Yes, and frontends proxy the connections for mailboxes on the backend to 
the same fake server address.

> My plan is that they have different server names in mupdate.  There's a
> separate channel somehow to say which is the primary out of those
> servers, which can be switched however (failover tooling) based on 
> which
> servers are up, but the murder has the facts about where the mailbox
> really exists.
> 
> It may even have statuscache.  Man, how awesome would distributed
> statuscache be.
> 
> So there are multiple records for the same mailbox, with different 
> server
> names, in the murder DB.
> 

Would this not open back up a route to entertaining a variety of race 
conditions (that would need to be addressed somehow) though?

Should then one of the duplicate mailboxes be marked as the primary?

A scenario that comes up often is the geographically close-yet-distant 
secondary site for disaster recovery, where a set of backends on the 
primary site replicate to a set of backends on the secondary site. While 
initially this succeeds perfectly fine, and the backends on the 
secondary site can participate in a local-to-site murder, transferring 
mailboxes from one backend to another on the primary site will fail to 
replicate to the secondary site's backends (because of their 
participation in the murder).

This is in part because it is not the XFER being replicated as such, but 
the target backend's CREATE/cmd_set(), which will fail because the 
mailbox already resides on another backend.

I suppose a scenario in which the mupdate master is in fact able to hold 
multiple records for the same mailbox might also allow us to overcome 
this conundrum?

>> Would using shared memory address the in-memory problem? Admittedly
>> I've never coded any such, so I'm out of my comfort zone (again).
> 
> I'm not really comfortable with it either.  I'd prefer a mailboxes 
> daemon
> with its own query language over a unix socket, because it punts a lot
> of the synchronisation problems.
> 

Such mbox list daemon solely for efficient lookups -- it would not need 
persistent storage at all would it? If so, would memcache be a 
consideration for such volatile key-value storage? The key then being 
storing what one wants as many times as is needed for lookups to be as 
efficient as possible, but it can possibly be shared over the network, 
and replicated memcached could be used for redundancy.

However, perhaps memcache is not so much suitable for the 
multiple-records-per-mailbox scenario.

>> > The minimum viable product for the fast LIST is basically this:
>> >
>> > * convert mupdated to use an sqlite file with the reverse indexes
>> >   built in to it instead of the mailboxes.db
>> > * convert the LIST code and mboxlist_lookup to use the sqlite file
>> > * even if not in a murder, also write mboxlist_* updates to the
>> >   sqlite file
>> > * leave all the existing murder stuff apart from this
>> >
>> > sqlite is already embedded for other things, so we don't add any
>> > dependencies.
>> >
>> 
>> I've had many issues with parallel (write) access by multiple
>> processes to a single sqlite database file, though, and needing to
>> vacuum the database file after not at all too many mutations
>> (thousands) as well, in order to keep things from slowing down.
> 
> Another reason to have a single thread doing the writes :)
> 
>> Is using SQLite for mailboxes.db not going to enter this sort of
>> problem space?
> 
> Perhaps.  We'd have to see how it copes in reality of course.  FastMail
> is big enough to test this pretty well!
> 

Yes, like I said on the IRC channel -- I'm scared of SQLite in this 
context, but if you're electing to make the change, since you're also on 
the receiving end of the angry pager, who am I to argue? ;-)

>> I can't find the actual patch file, so I must have dropped it, but
>> it's imap/mupdate.c line 1609 comparing the m->location found, if any,
>> to the const char *location passed along to cmd_set(), and if they're
>> (exactly) equal, not bailing.
> 
> Sure.  As I said above, I think the real solution is that sync_server
> creating a mailbox should always be allowed to assert the fact to the
> murder.  It's not a "please may I", it's a "this is how it is".
> 

My patch was an ugly workaround and not a proper solution regardless.

Kind regards,

Jeroen van Meeuwen

-- 
Systems Architect, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
m: +41 79 951 9003
w: https://kolabsystems.com

pgp: 9342 BF08