What would it take for FastMail to run murder

Tue Mar 17 20:51:55 EDT 2015

On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) wrote:
> On 2015-03-14 22:48, Bron Gondwana wrote:
> > On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) 
> > wrote:
> >> How, though, do you "ensure" that a mailbox for a new user in such
> >> business is created on the same backend as all the other users of
> >> said business?
> >
> > If the business already exists, the create user code will fetch the
> > server name from the business database table and make that the
> > creation server.
> >
> > There's a cron job which runs every hour and looks for users who
> > aren't on the right server, so if we import a user to the business,
> > they get moved.
> >
>
> Right -- so you seem to "agree" that "one business" is limited to "one
> backend server", which is precisely what the larger businesses that
> are our customers need to work around, when the number of mailboxes is
> typically "tens of thousands", and the mechanism you describe "stops
> working".

Exactly. It's a limit that we want to avoid, hence looking for a
murder-that-scales.

> > "happy" is a relative term. You can get most of the benefit from
> > using foolstupidclients, but otherwise you're paying O(N) for the
> > number of users - and taking 4 times as long to do every list
> > command is not ideal.
>
> Sure -- the majority of the processing delays seem to lay on the
> client side taking off the wire what is being dumped on it, however.

With over a million mailboxes in a single mailboxes.db I was seeing
parsing cost go up, particularly with DLIST.  I've written a dlist_sax
interface, which cuts out some of the cost, but it's still not free.

The easiest way to make things more efficient is not do them at all ;)

> You're far better entitled to speak to what is in a mailboxes.db
> and/or its in-memory representation by the time you get to scanning
> the complete list for items to which a user might have access, I just
> have to say we've not found this particular part to be as problematic
> for tens of thousands of users (yet).

It's going to hurt when you get to millions.  That's our issue.  If we
merged all the mailboxes.db across all our servers into one place,
that's a huge database.

> For frontends specifically ("discrete murder"), we're able to use
> tmpfs for mailboxes.db (and some other stuff of course) solving a
> bit of the
> I/O constraints, but it's still a list of folders with parameters
>   containing whether the user has access, and what I meant was perhaps
>   the list can (in addition) be inverted to be a list of users with
>   folders (and rights?).

That's pretty much exactly the idea.  That and avoiding the SPOF that's
a murder master right now.  They're kind of separate goals, we could do
one without the other.

> We promote a standby frontend not otherwise used, to become the new
> mupdate server. The interruption is a matter of seconds this way,
> unless of course you're in the typical stalemate.

Hmm.... so maybe it's affordable.  It scales up with number-of-servers
as well though.  Making sure it's up to date costs at least O(number of
backends).

> > Interesting.  Does it also handle the case where the same mailbox
> > gets accidentally created on two servers which aren't replica pairs
> > though? Or do you get a mailbox fork?
> >
>
> The race condition is not addressed with it, like it is not addressed
> currently.

I'm not 100% happy living with unaddressed race conditions.  Addressing
this would be an important part of making FastMail happy to run it.

> It solely makes the MUPDATE server not reject the reservation
> request from a server that uses the same "servername" if it already
> has an entry for the same "servername!partition", so that the
> replica successfully creates the local copy -- after which
> replication is happy.

Yeah, that makes sense.  Of course, the backend should probably not be
"reserving" so much.  There are two things conflated here:

1) I'm running cmd_create in an IMAPd and I want to see if this folder
   already exists.

2) I'm a replica backend getting a copy of an existing folder (or
   indeed, a backend which already has a folder) and I'm informing
   mupdate of the fact.

Those two should be treated differently.  The first is "does this
already exist", which is a legitimate question to ask.  The second
should always succeed. MUPDATE is a representation of facts, and the
backends are the masters of those facts.

> So this would build a scenario in which:
>
>    "pair-1-replica-1.example.org" and "pair-1-replica-2.example.org"
>    present themselves as "pair-1.example.org"
>
>    A DNS IN A RR is created for the fail-over address(es) for "pair-
>    1.example.org" and attached to whichever replica in the pair is
>    considered the active node.
>
> Both replicas would be configured to replicate to one another, which
> works in a PoC scenario but may seem to require lmtpd/AF_INET
> delivery.

So they both have the same server name in mupdate.

My plan is that they have different server names in mupdate.  There's a
separate channel somehow to say which is the primary out of those
servers, which can be switched however (failover tooling) based on which
servers are up, but the murder has the facts about where the mailbox
really exists.

It may even have statuscache.  Man, how awesome would distributed
statuscache be.

So there are multiple records for the same mailbox, with different server
names, in the murder DB.

> > Sounds compelling. The only problem I can see is if startup is
> > really expensive.  There's also a problem with "in-memory" with
> > separate processes.
> >
>
> I suppose another problem is updates to mailboxes.db, although I
> suppose this would mean updating the in-memory lookup tree then
> syncing it to disk.

Absoultely.

> Would using shared memory address the in-memory problem? Admittedly
> I've never coded any such, so I'm out of my comfort zone (again).

I'm not really comfortable with it either.  I'd prefer a mailboxes daemon
with its own query language over a unix socket, because it punts a lot
of the synchronisation problems.

> > The minimum viable product for the fast LIST is basically this:
> >
> > * convert mupdated to use an sqlite file with the reverse indexes
> >   built in to it instead of the mailboxes.db
> > * convert the LIST code and mboxlist_lookup to use the sqlite file
> > * even if not in a murder, also write mboxlist_* updates to the
> >   sqlite file
> > * leave all the existing murder stuff apart from this
> >
> > sqlite is already embedded for other things, so we don't add any
> > dependencies.
> >
>
> I've had many issues with parallel (write) access by multiple
> processes to a single sqlite database file, though, and needing to
> vacuum the database file after not at all too many mutations
> (thousands) as well, in order to keep things from slowing down.

Another reason to have a single thread doing the writes :)

> Is using SQLite for mailboxes.db not going to enter this sort of
> problem space?

Perhaps.  We'd have to see how it copes in reality of course.  FastMail
is big enough to test this pretty well!

> I can't find the actual patch file, so I must have dropped it, but
> it's imap/mupdate.c line 1609 comparing the m->location found, if any,
> to the const char *location passed along to cmd_set(), and if they're
> (exactly) equal, not bailing.

Sure.  As I said above, I think the real solution is that sync_server
creating a mailbox should always be allowed to assert the fact to the
murder.  It's not a "please may I", it's a "this is how it is".

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm