painful mupdate syncs between front-ends and database server

Fri Oct 30 15:24:25 EDT 2009

I apologize for not responding sooner here.  I've had my head down in the 
code and doing some tests, including playing with Bron's patch here.

I haven't had the guts to roll the patched, CVS version into production as 
our primary mupdate server, but I did put it in on a test machine in 
replica mode.  My measurement was on a clean server (no pre-existing 
mailboxes.db), and it didn't appear noticeably faster.  I haven't measured 
hard numbers, but it was still well over 10 minutes to complete the sync 
and write it out to disk.

The odd thing is that we see major performance differences depending on 
what disk the client is living on.  For instance, if we put the 
mailboxes.db (and the whole metapartition) on superfast Hitachi disks over 
a 4 GB SAN connection, the sync will finish in just under three minutes. 
Still, even though we see that big difference, we don't see any kind of I/O 
contention in the iostat output.  the k/sec figures are well within what 
the drives should be able to handle, and the % blocking stays in low single 
digits most of the time, while peeking up in the 15-25 range from time to 
time, but not staying there.  It does make me wonder if what we're seeing 
is related to I/O latency.

I haven't delved deep into the skiplist code, but I almost wonder if at 
least some of the slowness is the foreach iteration on the mupdate master 
in read mode.  On all systems in the murder, we'll see instances where the 
mupdate process goes into a spin where, in truss, it's an endless repeat of 
fcntl, stat, fstat, fcntl, thousands of times over.  These execute 
extremely quickly, but I do wonder if we're assuming that something that 
takes very little time takes an insignificant amount of time, when the time 
involved becomes significant on an 800k mailboxes database.

Finally, as to how we get into this situation in the first place, it 
appears to happen when the mupdate master, in our environment and 
configuration, can handle having up to three replicas connected to it 
before it goes into a bad state during high load.  I've never caught it at 
the point of actually going downhill, but my impression is that so many 
processes start demanding responses from the mupdate server that the 
persistent connections that the slave mupdates have to the master timeout 
and disconnect, then reconnect and try to re-sync.  (At least that's what 
it looks like in the logs.)  Incoming IMAP connections won't do it, but 
lmtpproxy connections seem to have a knack for it, since for whatever 
reason they appear to generate "kicks" at a pretty high rate.

Still looking, but open to suggestions here.

Michael Bacon
UNC Chapel Hill

--On October 20, 2009 12:54:45 PM +1100 Bron Gondwana <brong at fastmail.fm> 
wrote:

>
>
> On Mon, 19 Oct 2009 16:38 -0400, "Michael Bacon" <baconm at email.unc.edu>
> wrote:
>> When we spec'ed out our servers, we didn't put much I/O capacity into
>> the  front-end servers -- just a pair of mirrored 10k disks doing the
>> OS, the  logging, the mailboxes.db, and all the webmail action going on
>> in another  solaris zone on the same hardware.  We thought this was
>> sufficient given  the fact that no real permanent data lives on these
>> servers, but it turns  out that while most of thie time it's fine, if
>> the mupdate processes ever  decide they need to re-sync with the master,
>> we've got 6 minutes of trouble
>> ahead while it downloads and stores the 800k entries in the mailboxes.db.
>
> Have you checked if it's actually IO limited?  Reading the code, it
> appears to do the entire sync in a single transaction, which is bad
> because it locks the entire mailboxes.db for the entire time.
>
>> During these sync periods, we see two negative impacts.  The first is
>> lockup on the mailboxes.db on the front-end servers, which slows down
>> both
>> accepting new IMAP/POP connections and the reception of incoming
>> messages.
>> (The front-ends also accept LMTP connections from a separate pair of
>> queueing hosts, then proxy those to the back-ends.)  The second is that,
>> because the front-ends go into a
>
> Lost you there - I'm assuming it causes a nasty load spike when it
> finishes too.  Makes sense.
>
>> I suppose this is Fastmail and others ripped out the proxyd's and
>> replaced
>> them with nginx or perdition.  Currently we still support GSSAPI as an
>> auth
>> mechanism, which kept me from going that direction, but given the
>> problems
>> we're seeing, I'd be open to architectural suggestions on either how to
>> tie
>> perdition or nginx to the MUPDATE master (because we don't have the
>> back-ends split along any discernable lines at this point), or
>> suggestions
>> on how to make the master-to-frontend propagation faster or less painful.
>
> We didn't ever go with murder.  All our backends are totally independent.
>
>> Sorry for the long message, but it's not a simple problem we're fighting.
>
> No - it's not!  I wonder if a better approach would be to batch the
> mailboxes.db updates into groups of no more than (say) 256.
>
> Arrgh - stupid, stupid, stupid.  Layers of abstraction mean we have a nice
> fast "foreach" going on, and then throw away the data and dataptr fields,
> followed by which we fetch the data field again.  It's very inefficient.
> I wonder what percentage of the time is just reading stuff from the
> mailboxes.db?
>
> Anyway - the bit that's actually going to be blocking you will be the
> mailboxes.db transactions.  I've attached a patch.  Advance warning - I
> don't use murder, so I haven't done more than compile test it!  It SHOULD
> be safe though, it just commits to the mailboxes.db every 256 changes and
> then closes the transaction, which means that things that were queued
> waiting for the lock should get a chance to run before you update the
> next 256 records.
>
> The patch is against current CVS (well, against my git clone of current
> CVS anyway)
>
> Bron.
> --
>   Bron Gondwana
>   brong at fastmail.fm
>