Recovering from a broken master...
Nic Bernstein
nic at onlight.com
Wed Aug 6 16:03:19 EDT 2014
Friends,
We've got a simple Murder deployed, 2 front-ends, 1 mupdate-master, 1
backend and 1 replica. Recently, due to an array malfunction, the
back-end master took a powder, and we switched to the replica. Now
we're trying to recover the original master, and running into lots of
problems getting data to sync back.
This is all with version 2.4.17-caldav-beta9, from Debian packages, on
Ubuntu 14.04 servers. For the record, the servers are KVM QEMU VMs, tho
I doubt that matters at all.
We've got the roles reversed just fine with changes to the various
cyrus.conf and imapd.conf files, and are not worried about that being a
problem. Everything is working fine as far as
authentication/authorization, etc. It's just the replication that's fubar.
We're seeing this sort of error in the logs on the (new) master side:
...
Aug 6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Promoting:
MAILBOX user.connie.yadda -> USER connie
Aug 6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Promoting:
MAILBOX user.elly.Junk -> USER elly
Aug 6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Error in
do_sync(): bailing out! Bad protocol
Aug 6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Processing sync
log file /var/lib/imap/sync/log-27000 failed: Bad protocol
And this on the (new) replica side:
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: executed
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: accepted connection
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: cmdloop(): startup
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: login:
mailbox.ia.occinc.com [192.168.220.24] mailproxy PLAIN User logged in
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created
decompress buffer of 4102 bytes
Aug 6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created compress
buffer of 4102 bytes
Aug 6 18:20:59 mailbox.wi cyrus/syncserver[13158]: Repacking
mailbox user.ndlocate
Aug 6 18:21:05 mailbox.wi master[11811]: service syncserver pid
13158 in BUSY state: terminated abnormally
In some cases we've seen problems we believe are due to issues with a
particular user's mailbox, and have fixed those by blowing away the
user's mailbox hierarchy on the replica, rsync-ing it back over from the
master, and then doing a user-sync. But there are hundreds of users, so
that's not a practical general solution.
The mailstore is currently about 130GB in size, and the master and
replica are in different data centers, with only about 3 or 4Mbps
available between them (depending upon time of day). This is fine in
the normal course of rolling replication, but makes simply
re-replication the entire thing a major pain, if that's the only option.
So, what's causing this problem, and what's the best course of action to
recover from this sort of situation?
Thanks in advance for your consideration,
-nic
--
Nic Bernstein nic at onlight.com
Onlight, Inc. www.onlight.com
219 N. Milwaukee St., Suite 2a v. 414.272.4477
Milwaukee, Wisconsin 53202
More information about the Info-cyrus
mailing list