Implement Cyrus IMAPD in High Load Enviromment

Tue Sep 29 18:41:29 EDT 2009

On Tue, Sep 29, 2009 at 09:19:13AM -0400, Brian Awood wrote:
> 
> 
> On Tuesday 29 September 2009 @ 06:59, Bernd Petrovitsch wrote:
> > On Mon, 2009-09-28 at 15:33 -0700, Vincent Fox wrote:
> > [...]
> >
> > > Really I've looked at fsck too many times in my life and
> > > don't ever want to again.  Anyone who tells me "oh yes but
> >
> > Especially not in the >100GB area.
> 
> We haven't looked at ZFS, though as Bron suggested, I doubt it will 
> solve all filesystem issues.  We use ext3 on large partitions, 
> ranging from 2-5TB.  While it takes 14-18hrs to fsck, that doesn't 
> really matter if you have replication, we can promote a replica to a 
> primary in about 15minutes.  

15 _MINUTES_?  My god.  Does it need a massage and having its nails done?

It takes us roughly 15 _seconds_ to do a failover.  And most of that is
monitoring that makes sure everything has started up properly (including
database checkpoints completing)

Possibly the secret is that we use IPAddr2 from linux-ha to force ARP
flushes, and we transfer the primary IP address between machines, so
nothing else needs to know - we just shut down one end and bring up the
other with the IP and it's all good.

Our process is:

a) check there are less than 10kb of files in $conf/sync/ - else abort
b) shut down the master (host A)
c) run sync_client -f $file on each file in $conf/sync (if any)
c2) (if any sync fails, restart the master (host A))
d) shut down the replica (host B)
e) update the database with the new master location
f) start up the replica (host A)
g) start up the master (host B)

This means replication starts immediately, because the replica is 
already there when the master starts.

Bron.