Recomendations for a 15000 Cyrus Mailboxes

Thu May 10 10:16:07 EDT 2007

On Thu, May 10, 2007 at 08:30:11AM -0400, Nik Conwell wrote:
> 
> On Apr 11, 2007, at 8:37 PM, Bron Gondwana wrote:
> 
> >As for complexity?  It's on the cusp.  We've certainly had many more
> >users on a single instance before, but we prefer to keep under 10k  
> >users
> >per Cyrus instance these days for quicker recoverability.  It really
> 
> Hi - just a clarification question - when you say 10k users per 
> Cyrus  instance and you mentioned in an earlier message each machine 
> hosts  "multiple (in the teens) of this size stores," does this 
> include the  replicas?  So for example, one of your xSeries boxes 
> might host 16  instances, 8 master, 8 replica, so the box would 
> master about 80k  users and provide replica backups for another 80K 
> users?

Yes, your assumption is correct.  We have both masters and replicas,
though nothing like that organised!  Each machine has replicas spread
over as many different machines as possible (though for historical
reasons there are a couple of pairings that are a bit busy - I'm working
on splitting those up as we get new machines)

... that way we can fail all the masters off one machine without 
causing too much load on any one other machine, though it does mean
we can only have one or two machines down at any one time, rather than
up to half of them.

We actually lost a controller chip in a RAID unit recently and our "hot
spare" turned out to be broken as well, so we had a choice of leave
replication down or expand into the spare slots we had sitting around.
We wound up expanding.  I have a script called sync_all_users which runs
in tandem with monitorsync.  Monitorsync runs from cron every 10 minutes
and checks that sync_client processes are running correctly for each
master slot on a machine.  It will also run sync_client for any leftover
files after a failure, email us about what's happening, restart the
rolling replication, etc.  It's very nice.  It has locking which
integrates with our failover script (which runs replication for any
remaining log files after taking cyrus down) and etc.

So sync_all_users runs a sync_client -u on every user who is in our
database as "should be active on this machine", cleans out any logs
which were written before it started and then starts rolling replication
on all logs that were written since it started (you could do more clever
stuff with alphabetical time stamping, but it's a bit of a pointless
optimisation, it tends to catch up quickly when there's not much changed
anyway).

So it took maybe a day to be fully back up to date, which still isn't
ideal, but it was a day of no downtime, just replica unsafety.

Bron.