painful mupdate syncs between front-ends and database server

Bron Gondwana brong at fastmail.fm
Sat Oct 31 02:28:40 EDT 2009


On Fri, Oct 30, 2009 at 03:24:25PM -0400, Michael Bacon wrote:
> I haven't had the guts to roll the patched, CVS version into
> production as our primary mupdate server, but I did put it in on a
> test machine in replica mode.  My measurement was on a clean server
> (no pre-existing mailboxes.db), and it didn't appear noticeably
> faster.  I haven't measured hard numbers, but it was still well over
> 10 minutes to complete the sync and write it out to disk.
 
Sorry - I probably didn't explain what the patch does very well!  It
doesn't actually make things run any faster - what it does it breaks
the one big transaction into lots of small transactions so it doens't
block everything else from happening while it runs.

> The odd thing is that we see major performance differences depending
> on what disk the client is living on.  For instance, if we put the
> mailboxes.db (and the whole metapartition) on superfast Hitachi
> disks over a 4 GB SAN connection, the sync will finish in just under
> three minutes. Still, even though we see that big difference, we
> don't see any kind of I/O contention in the iostat output.  the
> k/sec figures are well within what the drives should be able to
> handle, and the % blocking stays in low single digits most of the
> time, while peeking up in the 15-25 range from time to time, but not
> staying there.  It does make me wonder if what we're seeing is
> related to I/O latency.

Hmm, yeah.

> I haven't delved deep into the skiplist code, but I almost wonder if
> at least some of the slowness is the foreach iteration on the
> mupdate master in read mode.  On all systems in the murder, we'll
> see instances where the mupdate process goes into a spin where, in
> truss, it's an endless repeat of fcntl, stat, fstat, fcntl,
> thousands of times over.  These execute extremely quickly, but I do
> wonder if we're assuming that something that takes very little time
> takes an insignificant amount of time, when the time involved
> becomes significant on an 800k mailboxes database.

Almost definitely.  I didn't even look at that end of the operation, but
I suspect this could be made a lot more efficient with transactional batching
as well.  Either read all 800k database into a linked list in memory, or
do something even trickier.  The even trickier bit will be pretty nasty
though.  Here's what I really want to add to the cyrus db layer:

/* pseudocode */
db->next_record(char *key, int keylen, db_txn *txn);

Which gets the next record AFTER the (possibly non-existant) record pointed
to by key.

This is what foreach uses internally - but by having it directly accessible
you could implement a partial, restartable foreach.

> Finally, as to how we get into this situation in the first place, it
> appears to happen when the mupdate master, in our environment and
> configuration, can handle having up to three replicas connected to
> it before it goes into a bad state during high load.  I've never
> caught it at the point of actually going downhill, but my impression
> is that so many processes start demanding responses from the mupdate
> server that the persistent connections that the slave mupdates have
> to the master timeout and disconnect, then reconnect and try to
> re-sync.  (At least that's what it looks like in the logs.)
> Incoming IMAP connections won't do it, but lmtpproxy connections
> seem to have a knack for it, since for whatever reason they appear
> to generate "kicks" at a pretty high rate.
> 
> Still looking, but open to suggestions here.

I'll have a look at speeding up the mupdate reads.

Bron.


More information about the Info-cyrus mailing list