Recovering from a broken master...

Mon Aug 11 10:11:52 EDT 2014

Wesley,
Thanks for your response.  This is precisely what we ended up doing. 
We've got a Perl script which walks LDAP for a user list, and runs
"sync_client -u <user>" for each account, trapping errors.  This gave us
a list for reconstruct.  In a couple of cases, however, even that didn't
remedy the situation, in which cases we resorted to rsync followed by
sync_client cleanups.

Thanks again.  Hopefully this message, with its subject line, will help
future unfortunate users grepping the mailing list archives for ideas.

Cheers,
    -nic

On 08/11/2014 08:46 AM, Wesley Craig wrote:
> So, sync server is crashing on the backend you're attempting to replicate back to.  Probably the cyrus meta files were corrupted for mailboxes which were actively being written to when you had the array malfunction.  To recover, I'd probably run sync client on each individual user to find which users are corrupted.  Armed with the list, I'd reconstruct those users and try again.
>
> Ideally, you'd get crash reports that you could forward along, since cyrus really ought to be armored against this kind of corruption.  After all, why else would you have failed over?
>
> :wes
>
> On 06 Aug 2014, at 16:03, Nic Bernstein <nic at onlight.com> wrote:
>
>> Friends,
>> We've got a simple Murder deployed, 2 front-ends, 1 mupdate-master, 1
>> backend and 1 replica.  Recently, due to an array malfunction, the
>> back-end master took a powder, and we switched to the replica.  Now
>> we're trying to recover the original master, and running into lots of
>> problems getting data to sync back.
>>
>> This is all with version 2.4.17-caldav-beta9, from Debian packages, on
>> Ubuntu 14.04 servers.  For the record, the servers are KVM QEMU VMs, tho
>> I doubt that matters at all.
>>
>> We've got the roles reversed just fine with changes to the various
>> cyrus.conf and imapd.conf files, and are not worried about that being a
>> problem.  Everything is working fine as far as
>> authentication/authorization, etc.  It's just the replication that's fubar.
>>
>> We're seeing this sort of error in the logs on the (new) master side:
>>    ...
>>    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]:   Promoting:
>> MAILBOX user.connie.yadda -> USER connie
>>    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]:   Promoting:
>> MAILBOX user.elly.Junk -> USER elly
>>    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Error in
>> do_sync(): bailing out! Bad protocol
>>    Aug  6 18:21:28 mailbox.ia cyrus/sync_client[27000]: Processing sync
>> log file /var/lib/imap/sync/log-27000 failed: Bad protocol
>>
>> And this on the (new) replica side:
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: executed
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: accepted connection
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: cmdloop(): startup
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: login:
>> mailbox.ia.occinc.com [192.168.220.24] mailproxy PLAIN User logged in
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created
>> decompress buffer of 4102 bytes
>>    Aug  6 18:20:37 mailbox.wi cyrus/syncserver[13158]: created compress
>> buffer of 4102 bytes
>>    Aug  6 18:20:59 mailbox.wi cyrus/syncserver[13158]: Repacking
>> mailbox user.ndlocate
>>    Aug  6 18:21:05 mailbox.wi master[11811]: service syncserver pid
>> 13158 in BUSY state: terminated abnormally
>>
>> In some cases we've seen problems we believe are due to issues with a
>> particular user's mailbox, and have fixed those by blowing away the
>> user's mailbox hierarchy on the replica, rsync-ing it back over from the
>> master, and then doing a user-sync.  But there are hundreds of users, so
>> that's not a practical general solution. 
>>
>> The mailstore is currently about 130GB in size, and the master and
>> replica are in different data centers, with only about 3 or 4Mbps
>> available between them (depending upon time of day).  This is fine in
>> the normal course of rolling replication, but makes simply
>> re-replication the entire thing a major pain, if that's the only option.
>>
>> So, what's causing this problem, and what's the best course of action to
>> recover from this sort of situation?
>>
>> Thanks in advance for your consideration,
>>    -nic
>>
>> -- 
>> Nic Bernstein                             nic at onlight.com
>> Onlight, Inc.                             www.onlight.com
>> 219 N. Milwaukee St., Suite 2a            v. 414.272.4477
>> Milwaukee, Wisconsin  53202
>>
>> ----
>> Cyrus Home Page: http://www.cyrusimap.org/
>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>> To Unsubscribe:
>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus

-- 
Nic Bernstein                             nic at onlight.com
Onlight, Inc.                             www.onlight.com
219 N. Milwaukee St., Suite 2a            v. 414.272.4477
Milwaukee, Wisconsin  53202