Basic two host replication scenario, SSL failure

Mon Jul 11 04:40:25 EDT 2011

On Sun, Jul 10, 2011 at 11:07 AM, Bron Gondwana <brong at fastmail.fm> wrote:
>
>
> On Sun, 10 Jul 2011 10:41 +0300, "Ivan Lezhnjov Jr." <ivan.lezhnjov.jr at gmail.com> wrote:
>  > First issue, and jmeeuwen in particular believes it to be a bug, but let
> > project developers be the judge of this, arises when R host goes down
> > while M is still available trying to push changes to R which has become
> > unavailable. What happens, essentially, is that when R becomes available
> > again M fails to push changes unless it [the M] is restarted.
>
> Two things here - one is that without keepalives, the sync_client can
> be stuck waiting "forever" for the replica.

Didn't have  tcp_keepalive: 1. Will note if changes anything.

> The second is that sync log files can be left around and never actually
> processed.  It's this second one that is more of a problem.  We have a
> program called "monitorsync" at FastMail that watches replication.  Other
> big sites do similar tricks.
>
> I would love to replace monitorsync with better logic in sync_client
> itself, but have not yet got to it!

So, what does "monitorsync" essentially do with those log files?

> > Another thing I do is emulate a situation where M becomes permanently
> > unavailable while R remains available. I switch R's role to M's by
> > restarting the service with new set of configuration files then. It
> > continues to serve and accept incoming mails just fine. Then, when M has
> > been fixed after an imaginary failure and becomes available again, I
> > switch M's role to role of R, and try to push changes from R turned M to
> > the M turned R.
> >
> > This doesn't work. Is this even supposed to work? No one really answered
> > this question directly on IRC. So, please, tell me if that what I'm
> > doing here has any sense at all.
>
> Let's call these machines "host A" and "host B".  In the initial config,
> A was master, B was replica.  A becomes unavailable, and you switch B
> over to being the master.
>
> So tell me how it's not working?  Are you configuring B with:
>
> 1) sync_log enabled so change events are logged?
> 2) sync_host configured to point at A?
> 3) a sync_client invocation in cyrus.conf?
>
> If you have, then it should replicate just fine once A comes back up.
> You may need to start the rolling sync_client by hand, or restart
> Cyrus on B, but we do this _all_the_time_ at FastMail, and it works
> perfectly well.

Well, I've all those options in imapd.conf and cyrus.conf and believe
it or not it doesn't they way you describe it should. I've started
with clean setup this morning and repeated the described scenario and
B switched to master won't replicate the changes to A switched to
replica.
I ran /var/lib/cyrus/sync_client -r manually and that's what I saw in
log files every time I tried to run it:

============================== A switched to replica

Jul 11 11:00:20 imapsite-master master[13277]: about to exec
/usr/lib/cyrus-imapd/sync_server
Jul 11 11:00:20 imapsite-master syncserver[13277]: executed
Jul 11 11:00:20 imapsite-master syncserver[13277]: accepted connection
Jul 11 11:00:20 imapsite-master syncserver[13277]: cmdloop(): startup
Jul 11 11:00:20 imapsite-master syncserver[13277]: imapd:Loading
hard-coded DH parameters
Jul 11 11:00:21 imapsite-master syncserver[13277]: SSL_accept()
incomplete -> wait

would eventually result in

Jul 11 11:21:14 imapsite-master syncserver[14019]: SSL_accept() timed
out -> fail
Jul 11 11:21:14 imapsite-master syncserver[14019]: STARTTLS failed:
imapsite-replica [10.10.0.188]

============================== B switched to master

Jul 11 11:33:45 imapsite-replica sync_client[29199]: couldn't
authenticate to backend server: no mechanism available
Jul 11 11:33:45 imapsite-replica sync_client[29479]: couldn't
authenticate to backend server: no mechanism available

>
> > So, I tried then to revert hosts to their original roles. The problem
> > then was that R had new messages that M never received while it had been
> > down. M would accept and store new incoming mail, but it would fail to
> > sync with R henceforth.
>
> Yeah, of course.  You're doing it wrong[tm].  In theory the sync system
> can recover from an accidental split brain like this, but it's not
> ideal.

I'd be happy to learn what I'm doing exactly wrong :)

>
> > This is what M output to log files (debug output turned on):
> > Jul  8 19:51:29 imapsite-master syncserver[12392]: EOF in SSL_accept()
> > -> fail
> > Jul  8 19:51:29 imapsite-master syncserver[12392]: STARTTLS failed:
> > imapsite-replica [10.10.0.188]
> > Jul  8 19:51:29 imapsite-master syncserver[12392]: telling master 1
> > Jul  8 19:51:29 imapsite-master syncserver[12392]: SSL_accept()
> > incomplete -> wait
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: EOF in SSL_accept()
> > -> fail
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: STARTTLS failed:
> > imapsite-replica [10.10.0.188]
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 1
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 2
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: accepted connection
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: telling master 3
> > Jul  8 19:52:27 imapsite-master syncserver[12392]: cmdloop(): startup
> > Jul  8 19:52:28 imapsite-master syncserver[12392]: SSL_accept()
> > incomplete -> wait
> >
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: MAILBOX received NO
> > response: IMAP_MAILBOX_CRC Checksum Failure
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: CRC failure on sync
> > for user.zxy, trying full update
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: SYNCERROR: guid
> > mismatch user.zxy 2 (9e19b5a1e93d3b3e7936044543b444ea00164649 b
> > b03edc18eaf8d2115510052dff24c65d894b107)
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: SYNCERROR: guid
> > mismatch user.zxy 2 (9e19b5a1e93d3b3e7936044543b444ea00164649 b
> > b03edc18eaf8d2115510052dff24c65d894b107)
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: user.zxy: same
> > message appears twice 2 3
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: Unlinking files in
> > mailbox user.zxy
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: do_folders(): update
> > failed: user.zxy 'Bad protocol'
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: Error in do_sync():
> > bailing out! Bad protocol
> > Jul  8 19:53:56 imapsite-master sync_client[12616]: Processing sync log
> > file /var/lib/imap/sync/log-12616 failed: Bad protocol
>
> Ouch, that's pretty horrible.  You've managed to corrupt your indexes.
>
> You didn't happen to use rsync on the spools at any time did you?

I'm not sure what I mean here but what I did was, as I said before, I
rsynced /var/{lib/imap,spool/imap} from B switched to master to A
switched to replica.

>
> > Effectively, the whole system becomes broken and only R turned M can
> > continue to accept new incoming mail and serve previously stored mail.
> > The replication becomes broken, though.
> >
> > My end goal is to have two hosts that will be able to replace one
> > another in case of a failure of one of them. That means I expect R to
> > become M, and when real M becomes available again push changes from real
> > R to M and continue running the service as if nothing happened. If then
> > real R becomes unavailable due to, say, disk controller failure, I would
> > revert M turned R to its original role and continue running the service.
> > So, both hosts would be mutually complementing in their ability to
> > switch between the roles of M and R, and sync up the other host.
>
> Yep, we do that.  It works fine so long as you always fail over with sync
> up to date, or at least don't rsync files underneath it!

OK, rsync  was just to try something different and experiment. But for
whatever reason replication in the described scenario doesn't work in
my setup :|

>
> > I also tried to duplicate /var/lib/imap and /var/spool/imap with rsync
> > in the scenario described up above. For the sake of clarity a brief
> > reminder follows.
> >
> > - M pushes changes to R.
> > - M goes down.
> > - I turn R to M.
> > - It receives new incoming messages.
> > - M becomes available again.
> > - I run rsync to copy /var/lib/imap and /var/spool/imap off R to replace
> > the corresponding directories on M.
> > - I stop R turned M.
> > - Change its configuration files so that it becomes R again.
> > - Start the real R
> > - Start the real M with rsync'ed /var/{lib/imap,spool/imap}
> >
> > Replication doesn't work now. The question is can it work after doing
> > this?
>
> You've got your rsynced spool and meta out of sync.  You will need to run
> a full reconstruct -G to fix this, which will replace the incorrect metadata
> with what's now in the spool.

Thank for the tip. Good to know that ;)

>
> > So, that's all I have to say perhaps. I would really appreciate any help
> > with this. This seems like a basic, trivial scenario to me but I just
> > can't seem to get cyrus-imap working right.
>
> It's not as trivial as it should be yet - and you can mess yourself up
> particularly if you go rsyncing stuff between machines!  If you have one
> host which is "correct" (host B in this case) I recommend that you do a
> full reconstruct -r -G on it, and then discard the replica and restart
> replication from scratch.

I've also just tried to apply these tips. Namely, when B switched to
master failed to push changes to A switched to replica I did the
following:
- stopping the service and then "discarding replica" by removing
/var/{lib/imap,spool/imap}
- restarting the service with replica role configuration (which is
correct by the way)

Anything else I could try or check?

PS: sorry for direct message to your inbox Bron :)