Clustering and replication

David Lang david.lang at digitalinsight.com
Tue Jan 30 16:22:44 EST 2007


On Mon, 29 Jan 2007, Tom Samplonius wrote:

> ----- "Bron Gondwana" <brong at fastmail.fm> wrote:
>> On Fri, Jan 26, 2007 at 12:20:15PM -0800, Tom Samplonius wrote:
>
>> * the system monitoring scripts do a 'du -s' on the sync directory every
>>   2 minutes and store the value in a database so our status commands can
>>   see if any store is behind (the trigger for noticing is 10kb, that's a
>>   couple of minutes worth of log during the U.S. day).  This also emails
>>   us if it gets above 100kb (approx 20 mins behind)
>
>  And what do you do if it gets behind?  I have three Cyrus groups right now, 
> that are never going to catch up.  They log about 20KB in 20 minutes, so the 
> update rate is not that high.  The machines are dedicated, and the replicas 
> aren't doing anything.  tcpdump confirms that there is traffic to the replica, 
> but the entire sync_client is so opaque it is hard to see what it is doing. 
> So sync_client can't keep up at all, and since it also quits from time to 
> time, it gets even worse.
>
>  I'm planning to hack the log, and add some logging to sync_client, 
> particularly to find the number of records per second it is able to process. 
> And then maybe someway to find why it quits all the time.
>
>  Either that, or my only alternative is to switch to using DRBD to sync the 
> filesystem to a standby server.
>
>> * a "monitorsync" process that runs from cron every 10 minutes and reads
>>   the contents of the sync directory, comparing any log-(\d+) file's PID
>>   with running processes to ensure it's actually being run and tries to
>>   sync_client -r -f the file if there isn't one.  It also checks that
>>   there is a running sync_client -r process (no -f) for the store.
>
>  Wow, a lot of protection to protect against sync_client just exiting. 
> sync_client isn't very big, so it shouldn't be that hard to find the different 
> places that it exits, and fix them?
>
>> * a weekly "checkreplication" script which logs in as each user to both
>>   the master and replica via IMAP and does a bunch of lists, examines,
>>   even fetches and compares the output to ensure they are identical.
>>
>> Between all that, I'm pretty comfortable that replication is correct and
>> we'll be told if it isn't.  It's certainly helped us find our share
>> of issues with the replication system!
>
>  Well, I know our replicas are out of sync, so we just don't use them.  I just 
> hope the master's don't fail.  Each pair has about 30,000 accounts, and about 
> 300GB of online mail.

Tom,
   in your situation you may want to seriously look at disabling fsync. doing so 
could let your replicas keep up.

it's definantly not ideal, but if I was forced to choose between

1. single box with fsync and no replica

or

2. master without fsync and replicas without fsync, but up to date

I would choose 2, as it won't loose any data due to a master failing, no matter 
what happens on the master, and I'm only vunerable to something that would take 
down both the primary and it's replica at the same time (don't have them both on 
the same UPS!)

David Lang


More information about the Info-cyrus mailing list