What would it take for FastMail to run murder

Sat Mar 14 17:48:56 EDT 2015

On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) wrote:
> On 2015-03-13 23:50, Bron Gondwana wrote:
> > So I've been doing a lot of thinking about Cyrus clustering, with the
> > underlying question being "what would it take to make FastMail run a
> > murder".  We've written a fair bit about our infrastructure - we use
> > nginx as a frontend proxy to direct traffic to backend servers, and 
> > have
> > no interdependencies between the backends, so that we can scale
> > indefinitely.  With murder as it exists now, we would be pushing the
> > limits of the system already - particularly with the globally
> > distributed datacentres.
> > 
> > Why would FastMail consider running murder, given our existing
> > nice system?
> > 
> > a) we support folder sharing within businesses, so at the moment we are
> >    limited by the size of a single slot.  Some businesses already push
> >    that limit.
> > 
> 
> How, though, do you "ensure" that a mailbox for a new user in such 
> business is created on the same backend as all the other users of said 
> business?

If the business already exists, the create user code will fetch the server name
from the business database table and make that the creation server.

There's a cron job which runs every hour and looks for users who aren't on
the right server, so if we import a user to the business, they get moved.

> > Here are our deal-breaker requirements:
> > 
> > 1) unified murder - we don't want to run both a frontend AND a backend
> >    imapd process  for every single connection.  We already have nginx,
> >    which is non-blocking, for the initial connection and auth handling.
> > 
> 
> There's one particular "problem" with using NGINX as the IMAP proxy -- 
> it requires that external service that responds with the address to 
> proxy to.

T108

> I say "problem" in quotes to emphasize I use the term "problem" very 
> loosely -- whether it be a functioning backend+mupdate+frontend or a 
> functioning backend+mupdate+frontend+nginx+service is a rather futile 
> distinction, relatively speaking.

Sure, but backend+distributed mailbox service+nginx would be a much
simpler setup.

> I don't understand how this is an established problem already -- or not 
> as much as I probably should. If 72k users can be happy on a murder 
> topology, surely 4 times as many could also be happen -- inefficiencies 
> notwithstanding, they're "only" a vertical scaling limitation.

"happy" is a relative term. You can get most of the benefit from using
foolstupidclients, but otherwise you're paying O(N) for the number of
users - and taking 4 times as long to do every list command is not ideal.

> That said of course I understand it has it's upper limit, but getting 
> updated lookup tables in-memory pushed there when an update happens 
> would seem to resolve the problem, no?

Solving the problem is having some kind of index/lookup table indeed.
Whether this is done all in-memory by some sort of LIST service which
scans the mailboxes.db at startup time and then gets updates from mupdate.

> This is not necessarily what a failed mupdate server does though -- new 
> folders and folder renames (includes deletions!) and folder transfers 
> won't work, but the cluster remains functional under both the 
> SMTP-to-backend and LMTP-proxy-via-frontend topology -- autocreate for 
> Sieve fileinto notwithstanding, and mailbox hierarchies distributed over 
> multiple backends when also using the SMTP-to-backend topoplogy 
> notwithstanding.

Yeah, until you start up the mupdate server again or configure a new one.
Again, you get user visible failures (folder create, etc) while the server is
down.  The reason I want to shave off all these edge cases is that in a
big enough system over a long enough time, you will hit every one of them.

> > Thankfully, the state of the art in distributed databases has moved a
> > long way since mupdate was written.
> 
> I have also written a one-or-two line patch that enables backends that 
> replicate, to both be a part of the same murder topology, to prevent the 
> replica "slave" from bailing out on the initial creation of a mailbox -- 
> consulting mupdate and finding that it would already exist.

Interesting.  Does it also handle the case where the same mailbox gets
accidentally created on two servers which aren't replica pairs though?
Or do you get a mailbox fork?

(it would be easier with the new mailboxes.db format, if mupdate supported
it, because you'd have the uniqueid which is generated by uuidgen - if THAT
matches, then you're a replica pair, not a newly CREATEd mailbox)

> > Along with this, we need a reverse lookup for ACLs, so that any one 
> > user
> > doesn't ever need to scan the entire mailboxes.db.  This might be 
> > hooked
> > into the distributed DB as well, or calculated locally on each node.
> > 
> 
> I reckon this may be the "rebuild more efficient lookup trees in-memory 
> or otherwise" I may have referred to just now just not in so many words.

Sounds compelling. The only problem I can see is if startup is really
expensive.  There's also a problem with "in-memory" with separate
processes.

The minimum viable product for the fast LIST is basically this:

* convert mupdated to use an sqlite file with the reverse indexes built in to it instead of the mailboxes.db
* convert the LIST code and mboxlist_lookup to use the sqlite file
* even if not in a murder, also write mboxlist_* updates to the sqlite file
* leave all the existing murder stuff apart from this

sqlite is already embedded for other things, so we don't add any dependencies.

> > And that's pretty much it.  There are some interesting factors around
> > replication, and I suspect the answer here is to have either multi-
> > value support or embed the backend name into the mailboxes.db key
> > (postfix) such that you wind up listing the same mailbox multiple
> > times.
> 
> In a scenario where only one backend is considered "active" for the 
> given (set of) mailbox(es), and the other is "passive", this has been 
> more of a one-line patch in mupdate plus the proper infrastructure in 
> DNS/keepalived type of failover service IP addresses than it has been 
> about allowing duplicates and suppressing them.

What is this one line patch?

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm