Basic two host replication scenario, SSL failure

Bron Gondwana brong at fastmail.fm
Sun Jul 10 04:07:48 EDT 2011



On Sun, 10 Jul 2011 10:41 +0300, "Ivan Lezhnjov Jr." <ivan.lezhnjov.jr at gmail.com> wrote:
 > First issue, and jmeeuwen in particular believes it to be a bug, but let
> project developers be the judge of this, arises when R host goes down
> while M is still available trying to push changes to R which has become
> unavailable. What happens, essentially, is that when R becomes available
> again M fails to push changes unless it [the M] is restarted.

Two things here - one is that without keepalives, the sync_client can
be stuck waiting "forever" for the replica.

The second is that sync log files can be left around and never actually
processed.  It's this second one that is more of a problem.  We have a
program called "monitorsync" at FastMail that watches replication.  Other
big sites do similar tricks.

I would love to replace monitorsync with better logic in sync_client
itself, but have not yet got to it!
 
> This is not entirely critical for me but it's quite annoying and
> somewhat unexpected.

Agreed.

> Another thing I do is emulate a situation where M becomes permanently
> unavailable while R remains available. I switch R's role to M's by
> restarting the service with new set of configuration files then. It
> continues to serve and accept incoming mails just fine. Then, when M has
> been fixed after an imaginary failure and becomes available again, I
> switch M's role to role of R, and try to push changes from R turned M to
> the M turned R. 
> 
> This doesn't work. Is this even supposed to work? No one really answered
> this question directly on IRC. So, please, tell me if that what I'm
> doing here has any sense at all.

Let's call these machines "host A" and "host B".  In the initial config,
A was master, B was replica.  A becomes unavailable, and you switch B
over to being the master.

So tell me how it's not working?  Are you configuring B with:

1) sync_log enabled so change events are logged?
2) sync_host configured to point at A?
3) a sync_client invocation in cyrus.conf?

If you have, then it should replicate just fine once A comes back up.
You may need to start the rolling sync_client by hand, or restart
Cyrus on B, but we do this _all_the_time_ at FastMail, and it works
perfectly well.

> So, I tried then to revert hosts to their original roles. The problem
> then was that R had new messages that M never received while it had been
> down. M would accept and store new incoming mail, but it would fail to
> sync with R henceforth.

Yeah, of course.  You're doing it wrong[tm].  In theory the sync system
can recover from an accidental split brain like this, but it's not
ideal.

> This is what M output to log files (debug output turned on):
> Jul  8 19:51:29 imapsite-master syncserver[12392]: EOF in SSL_accept()
> -> fail
> Jul  8 19:51:29 imapsite-master syncserver[12392]: STARTTLS failed:
> imapsite-replica [10.10.0.188]
> Jul  8 19:51:29 imapsite-master syncserver[12392]: telling master 1
> Jul  8 19:51:29 imapsite-master syncserver[12392]: SSL_accept()
> incomplete -> wait
> Jul  8 19:52:27 imapsite-master syncserver[12392]: EOF in SSL_accept()
> -> fail
> Jul  8 19:52:27 imapsite-master syncserver[12392]: STARTTLS failed:
> imapsite-replica [10.10.0.188]
> Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 1
> Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 2
> Jul  8 19:52:27 imapsite-master syncserver[12392]: accepted connection
> Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 3
> Jul  8 19:52:27 imapsite-master syncserver[12392]: cmdloop(): startup
> Jul  8 19:52:28 imapsite-master syncserver[12392]: SSL_accept()
> incomplete -> wait
> 
> Jul  8 19:53:56 imapsite-master sync_client[12616]: MAILBOX received NO
> response: IMAP_MAILBOX_CRC Checksum Failure
> Jul  8 19:53:56 imapsite-master sync_client[12616]: CRC failure on sync
> for user.zxy, trying full update
> Jul  8 19:53:56 imapsite-master sync_client[12616]: SYNCERROR: guid
> mismatch user.zxy 2 (9e19b5a1e93d3b3e7936044543b444ea00164649 b
> b03edc18eaf8d2115510052dff24c65d894b107)
> Jul  8 19:53:56 imapsite-master sync_client[12616]: SYNCERROR: guid
> mismatch user.zxy 2 (9e19b5a1e93d3b3e7936044543b444ea00164649 b
> b03edc18eaf8d2115510052dff24c65d894b107)
> Jul  8 19:53:56 imapsite-master sync_client[12616]: user.zxy: same
> message appears twice 2 3
> Jul  8 19:53:56 imapsite-master sync_client[12616]: Unlinking files in
> mailbox user.zxy
> Jul  8 19:53:56 imapsite-master sync_client[12616]: do_folders(): update
> failed: user.zxy 'Bad protocol'
> Jul  8 19:53:56 imapsite-master sync_client[12616]: Error in do_sync():
> bailing out! Bad protocol
> Jul  8 19:53:56 imapsite-master sync_client[12616]: Processing sync log
> file /var/lib/imap/sync/log-12616 failed: Bad protocol

Ouch, that's pretty horrible.  You've managed to corrupt your indexes.

You didn't happen to use rsync on the spools at any time did you?

> Effectively, the whole system becomes broken and only R turned M can
> continue to accept new incoming mail and serve previously stored mail.
> The replication becomes broken, though.
> 
> My end goal is to have two hosts that will be able to replace one
> another in case of a failure of one of them. That means I expect R to
> become M, and when real M becomes available again push changes from real
> R to M and continue running the service as if nothing happened. If then
> real R becomes unavailable due to, say, disk controller failure, I would
> revert M turned R to its original role and continue running the service.
> So, both hosts would be mutually complementing in their ability to
> switch between the roles of M and R, and sync up the other host.

Yep, we do that.  It works fine so long as you always fail over with sync
up to date, or at least don't rsync files underneath it!

> I also tried to duplicate /var/lib/imap and /var/spool/imap with rsync
> in the scenario described up above. For the sake of clarity a brief
> reminder follows. 
> 
> - M pushes changes to R.
> - M goes down.
> - I turn R to M. 
> - It receives new incoming messages. 
> - M becomes available again. 
> - I run rsync to copy /var/lib/imap and /var/spool/imap off R to replace
> the corresponding directories on M.
> - I stop R turned M.
> - Change its configuration files so that it becomes R again.
> - Start the real R
> - Start the real M with rsync'ed /var/{lib/imap,spool/imap}
> 
> Replication doesn't work now. The question is can it work after doing
> this?

You've got your rsynced spool and meta out of sync.  You will need to run
a full reconstruct -G to fix this, which will replace the incorrect metadata
with what's now in the spool.

> So, that's all I have to say perhaps. I would really appreciate any help
> with this. This seems like a basic, trivial scenario to me but I just
> can't seem to get cyrus-imap working right.

It's not as trivial as it should be yet - and you can mess yourself up
particularly if you go rsyncing stuff between machines!  If you have one
host which is "correct" (host B in this case) I recommend that you do a
full reconstruct -r -G on it, and then discard the replica and restart
replication from scratch.

Bron.
-- 
  Bron Gondwana
  brong at fastmail.fm



More information about the Info-cyrus mailing list