What would it take for FastMail to run murder

Tue Mar 17 18:00:52 EDT 2015

On 2015-03-14 22:48, Bron Gondwana wrote:
> On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) 
> wrote:
>> How, though, do you "ensure" that a mailbox for a new user in such
>> business is created on the same backend as all the other users of said
>> business?
> 
> If the business already exists, the create user code will fetch the 
> server name
> from the business database table and make that the creation server.
> 
> There's a cron job which runs every hour and looks for users who aren't 
> on
> the right server, so if we import a user to the business, they get 
> moved.
> 

Right -- so you seem to "agree" that "one business" is limited to "one 
backend server", which is precisely what the larger businesses that are 
our customers need to work around, when the number of mailboxes is 
typically "tens of thousands", and the mechanism you describe "stops 
working".

>> There's one particular "problem" with using NGINX as the IMAP proxy --
>> it requires that external service that responds with the address to
>> proxy to.
> 
> T108
> 
>> I say "problem" in quotes to emphasize I use the term "problem" very
>> loosely -- whether it be a functioning backend+mupdate+frontend or a
>> functioning backend+mupdate+frontend+nginx+service is a rather futile
>> distinction, relatively speaking.
> 
> Sure, but backend+distributed mailbox service+nginx would be a much
> simpler setup.
> 

Yes, T108 here ;-)

>> I don't understand how this is an established problem already -- or 
>> not
>> as much as I probably should. If 72k users can be happy on a murder
>> topology, surely 4 times as many could also be happen -- 
>> inefficiencies
>> notwithstanding, they're "only" a vertical scaling limitation.
> 
> "happy" is a relative term. You can get most of the benefit from using
> foolstupidclients, but otherwise you're paying O(N) for the number of
> users - and taking 4 times as long to do every list command is not 
> ideal.
> 

Sure -- the majority of the processing delays seem to lay on the client 
side taking off the wire what is being dumped on it, however.

You're far better entitled to speak to what is in a mailboxes.db and/or 
its in-memory representation by the time you get to scanning the 
complete list for items to which a user might have access, I just have 
to say we've not found this particular part to be as problematic for 
tens of thousands of users (yet).

>> That said of course I understand it has it's upper limit, but getting
>> updated lookup tables in-memory pushed there when an update happens
>> would seem to resolve the problem, no?
> 
> Solving the problem is having some kind of index/lookup table indeed.
> Whether this is done all in-memory by some sort of LIST service which
> scans the mailboxes.db at startup time and then gets updates from 
> mupdate.
> 

For frontends specifically ("discrete murder"), we're able to use tmpfs 
for mailboxes.db (and some other stuff of course) solving a bit of the 
I/O constraints, but it's still a list of folders with parameters 
containing whether the user has access, and what I meant was perhaps the 
list can (in addition) be inverted to be a list of users with folders 
(and rights?).

>> This is not necessarily what a failed mupdate server does though -- 
>> new
>> folders and folder renames (includes deletions!) and folder transfers
>> won't work, but the cluster remains functional under both the
>> SMTP-to-backend and LMTP-proxy-via-frontend topology -- autocreate for
>> Sieve fileinto notwithstanding, and mailbox hierarchies distributed 
>> over
>> multiple backends when also using the SMTP-to-backend topoplogy
>> notwithstanding.
> 
> Yeah, until you start up the mupdate server again or configure a new 
> one.
> Again, you get user visible failures (folder create, etc) while the 
> server is
> down.  The reason I want to shave off all these edge cases is that in a
> big enough system over a long enough time, you will hit every one of 
> them.
> 

We promote a standby frontend not otherwise used, to become the new 
mupdate server. The interruption is a matter of seconds this way, unless 
of course you're in the typical stalemate.

>> > Thankfully, the state of the art in distributed databases has moved a
>> > long way since mupdate was written.
>> 
>> I have also written a one-or-two line patch that enables backends that
>> replicate, to both be a part of the same murder topology, to prevent 
>> the
>> replica "slave" from bailing out on the initial creation of a mailbox 
>> --
>> consulting mupdate and finding that it would already exist.
> 
> Interesting.  Does it also handle the case where the same mailbox gets
> accidentally created on two servers which aren't replica pairs though?
> Or do you get a mailbox fork?
> 

The race condition is not addressed with it, like it is not addressed 
currently.

It solely makes the MUPDATE server not reject the reservation request 
from a server that uses the same "servername" if it already has an entry 
for the same "servername!partition", so that the replica successfully 
creates the local copy -- after which replication is happy.

So this would build a scenario in which:

   "pair-1-replica-1.example.org" and "pair-1-replica-2.example.org" 
present themselves as "pair-1.example.org"

   A DNS IN A RR is created for the fail-over address(es) for 
"pair-1.example.org" and attached to whichever replica in the pair is 
considered the active node.

Both replicas would be configured to replicate to one another, which 
works in a PoC scenario but may seem to require lmtpd/AF_INET delivery.

>> > Along with this, we need a reverse lookup for ACLs, so that any one
>> > user
>> > doesn't ever need to scan the entire mailboxes.db.  This might be
>> > hooked
>> > into the distributed DB as well, or calculated locally on each node.
>> >
>> 
>> I reckon this may be the "rebuild more efficient lookup trees 
>> in-memory
>> or otherwise" I may have referred to just now just not in so many 
>> words.
> 
> Sounds compelling. The only problem I can see is if startup is really
> expensive.  There's also a problem with "in-memory" with separate
> processes.
> 

I suppose another problem is updates to mailboxes.db, although I suppose 
this would mean updating the in-memory lookup tree then syncing it to 
disk.

Would using shared memory address the in-memory problem? Admittedly I've 
never coded any such, so I'm out of my comfort zone (again).

> The minimum viable product for the fast LIST is basically this:
> 
> * convert mupdated to use an sqlite file with the reverse indexes
> built in to it instead of the mailboxes.db
> * convert the LIST code and mboxlist_lookup to use the sqlite file
> * even if not in a murder, also write mboxlist_* updates to the sqlite 
> file
> * leave all the existing murder stuff apart from this
> 
> sqlite is already embedded for other things, so we don't add any 
> dependencies.
> 

I've had many issues with parallel (write) access by multiple processes 
to a single sqlite database file, though, and needing to vacuum the 
database file after not at all too many mutations (thousands) as well, 
in order to keep things from slowing down.

Is using SQLite for mailboxes.db not going to enter this sort of problem 
space?

>> > And that's pretty much it.  There are some interesting factors around
>> > replication, and I suspect the answer here is to have either multi-
>> > value support or embed the backend name into the mailboxes.db key
>> > (postfix) such that you wind up listing the same mailbox multiple
>> > times.
>> 
>> In a scenario where only one backend is considered "active" for the
>> given (set of) mailbox(es), and the other is "passive", this has been
>> more of a one-line patch in mupdate plus the proper infrastructure in
>> DNS/keepalived type of failover service IP addresses than it has been
>> about allowing duplicates and suppressing them.
> 
> What is this one line patch?
> 

I can't find the actual patch file, so I must have dropped it, but it's 
imap/mupdate.c line 1609 comparing the m->location found, if any, to the 
const char *location passed along to cmd_set(), and if they're (exactly) 
equal, not bailing.

Kind regards,

Jeroen van Meeuwen

-- 
Systems Architect, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
m: +41 79 951 9003
w: https://kolabsystems.com

pgp: 9342 BF08