painful mupdate syncs between front-ends and database server

Sat Oct 31 06:02:13 EDT 2009

On Fri, 30 Oct 2009, Michael Bacon wrote:

> On all systems in the murder, we'll see instances where the mupdate 
> process goes into a spin where, in truss, it's an endless repeat of 
> fcntl, stat, fstat, fcntl, thousands of times over.  These execute 
> extremely quickly, but I do wonder if we're assuming that something that 
> takes very little time takes an insignificant amount of time, when the 
> time involved becomes significant on an 800k mailboxes database.

I agree that latency is probably your problem here.

I'm wondering if fsync() latency on the frontends might be a factor given 
that you report little disk I/O on the mupdate master (IOPS are much more 
important than Kps, but I'm sure that you already know that). The update 
process will only be as fast as its weakest link, and you stated earlier:

> When we spec'ed out our servers, we didn't put much I/O capacity into 
> the front-end servers -- just a pair of mirrored 10k disks doing the OS, 
> the logging, the mailboxes.db, and all the webmail action going on in 
> another solaris zone on the same hardware.

No mention of battery backed write cache there, which tends to be fairly 
critical for anything involving fsync(). There is an easy way to find out:

   skiplist_unsafe: 0
   If enabled, this option forces the skiplist cyrusdb backend to not
   sync writes to the disk.  Enabling this option is NOT RECOMMENDED.

You can ignore the scary warning (at least for test purposes) on murder 
frontends, given that it is just a readonly replica of the mupdate master.

I hope that this isn't a complete red herring. It just struck me that it 
would be a really easy test to make.

-- 
David Carter                             Email: David.Carter at ucs.cam.ac.uk
University Computing Service,            Phone: (01223) 334502
New Museums Site, Pembroke Street,       Fax:   (01223) 334679
Cambridge UK. CB2 3QH.