Clustering and replication

Mon Jan 29 19:04:36 EST 2007

----- "Bron Gondwana" <brong at fastmail.fm> wrote:
> On Fri, Jan 26, 2007 at 12:20:15PM -0800, Tom Samplonius wrote:
> > ----- Wesley Craig <wes at umich.edu> wrote:
> > > Close.  imapd, pop3d, lmtpd, and other processes write to the log.
>   
> > > The log is read by sync_client.  This merely tells sync_client what  
> > > (probably) has changed.  sync_client roll up certain log items, e.g., 
> > > it may decide to compare a whole user's state rather than just  
> > > looking at multiple mailboxes.  Once it decides what to compare, it  
> > > retrieves IMAP-like state information from sync_server (running on
>  
> > > the replica) and pushes those changes that are necessary.
> > 
> >   And this exposes the big weakness with Cyrus syncing:  there is
> only a single sync_client, and it is very easy for it get behind.
> 
> Which is why we have the following:
> 
> * a logwatcher on imapd.log which emails us on bailing out or other
>   "unrecognised" log lines

  sync_client prints errors from time to time, but most seem harmless.  It certainly does not print anything like "Exiting...", when it decides to quit.  I don't really know which log lines are bad, or not.  What do you consider a "recognized" log line?  

  In my case, sync_client quits three to four times a day.

> * the system monitoring scripts do a 'du -s' on the sync directory every
>   2 minutes and store the value in a database so our status commands can
>   see if any store is behind (the trigger for noticing is 10kb, that's a
>   couple of minutes worth of log during the U.S. day).  This also emails
>   us if it gets above 100kb (approx 20 mins behind)

  And what do you do if it gets behind?  I have three Cyrus groups right now, that are never going to catch up.  They log about 20KB in 20 minutes, so the update rate is not that high.  The machines are dedicated, and the replicas aren't doing anything.  tcpdump confirms that there is traffic to the replica, but the entire sync_client is so opaque it is hard to see what it is doing.  So sync_client can't keep up at all, and since it also quits from time to time, it gets even worse.

  I'm planning to hack the log, and add some logging to sync_client, particularly to find the number of records per second it is able to process.  And then maybe someway to find why it quits all the time.

  Either that, or my only alternative is to switch to using DRBD to sync the filesystem to a standby server.

> * a "monitorsync" process that runs from cron every 10 minutes and reads
>   the contents of the sync directory, comparing any log-(\d+) file's PID
>   with running processes to ensure it's actually being run and tries to
>   sync_client -r -f the file if there isn't one.  It also checks that
>   there is a running sync_client -r process (no -f) for the store.

  Wow, a lot of protection to protect against sync_client just exiting.  sync_client isn't very big, so it shouldn't be that hard to find the different places that it exits, and fix them?

> * a weekly "checkreplication" script which logs in as each user to both
>   the master and replica via IMAP and does a bunch of lists, examines,
>   even fetches and compares the output to ensure they are identical.
> 
> Between all that, I'm pretty comfortable that replication is correct and
> we'll be told if it isn't.  It's certainly helped us find our share
> of issues with the replication system!

  Well, I know our replicas are out of sync, so we just don't use them.  I just hope the master's don't fail.  Each pair has about 30,000 accounts, and about 300GB of online mail.  

  And it seems like the multiple exit points in sync_client mean that there are significant bugs in sync_client still.  And since re-starting sync_client on the same sync log appears to work, it means that there is something rather wrong there.

Tom