Recovering from a broken master...

Nic Bernstein nic at onlight.com
Wed Aug 6 16:03:19 EDT 2014


Friends,
We've got a simple Murder deployed, 2 front-ends, 1 mupdate-master, 1
backend and 1 replica.  Recently, due to an array malfunction, the
back-end master took a powder, and we switched to the replica.  Now
we're trying to recover the original master, and running into lots of
problems getting data to sync back.

This is all with version 2.4.17-caldav-beta9, from Debian packages, on
Ubuntu 14.04 servers.  For the record, the servers are KVM QEMU VMs, tho
I doubt that matters at all.

We've got the roles reversed just fine with changes to the various
cyrus.conf and imapd.conf files, and are not worried about that being a
problem.  Everything is working fine as far as
authentication/authorization, etc.  It's just the replication that's fubar.

We're seeing this sort of error in the logs on the (new) master side:
    ...
    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]:   Promoting:
MAILBOX user.connie.yadda -> USER connie
    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]:   Promoting:
MAILBOX user.elly.Junk -> USER elly
    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Error in
do_sync(): bailing out! Bad protocol
    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Processing sync
log file /var/lib/imap/sync/log-27000 failed: Bad protocol

And this on the (new) replica side:
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: executed
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: accepted connection
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: cmdloop(): startup
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: login:
mailbox.ia.occinc.com [192.168.220.24] mailproxy PLAIN User logged in
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created
decompress buffer of 4102 bytes
    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created compress
buffer of 4102 bytes
    Aug  6 18:20:59 mailbox.wi cyrus/syncserver[13158]: Repacking
mailbox user.ndlocate
    Aug  6 18:21:05 mailbox.wi master[11811]: service syncserver pid
13158 in BUSY state: terminated abnormally

In some cases we've seen problems we believe are due to issues with a
particular user's mailbox, and have fixed those by blowing away the
user's mailbox hierarchy on the replica, rsync-ing it back over from the
master, and then doing a user-sync.  But there are hundreds of users, so
that's not a practical general solution. 

The mailstore is currently about 130GB in size, and the master and
replica are in different data centers, with only about 3 or 4Mbps
available between them (depending upon time of day).  This is fine in
the normal course of rolling replication, but makes simply
re-replication the entire thing a major pain, if that's the only option.

So, what's causing this problem, and what's the best course of action to
recover from this sort of situation?

Thanks in advance for your consideration,
    -nic

-- 
Nic Bernstein                             nic at onlight.com
Onlight, Inc.                             www.onlight.com
219 N. Milwaukee St., Suite 2a            v. 414.272.4477
Milwaukee, Wisconsin  53202



More information about the Info-cyrus mailing list