replication stalled (in 2.3.8)

Paul Dekkers Paul.Dekkers at surfnet.nl
Thu Feb 15 10:46:38 EST 2007


Hi,

Yesterday morning I did an upgrade from 2.3.7(-11) to 2.3.8(-3) using
the RPM packages made by Simon. Everything seemed fine, but at the end
of the day the replication stalled; it was actually in a way I did not
see before with 2.3.7 (and earlier), so I thought it might be worth
reporting. (Just to be sure, maybe someone recognizes it, maybe it rings
a bell.)

I noticed it because our sync_client bailed out. We monitor this, and
actually restart the sync_client process right away using the same
script. I always look afterwards if everything is indeed back to normal.

Normally the synchronization continues as usual after that restart, this
time I needed a restart of the sync_server on the replica (I just
restarted the cyrus-master on the replica, actually). Before the
restart, it seemed that the syncserver was never in an "unlocked" state.
Just an example from the logs of the replica that afternoon:

Feb 14 17:17:07 rogge master[7033]: about to exec
/usr/lib/cyrus-imapd/sync_server
Feb 14 17:17:07 rogge syncserver[7033]: executed
Feb 14 17:17:07 rogge syncserver[7033]: accepted connection
Feb 14 17:17:07 rogge syncserver[7033]: cmdloop(): startup
Feb 14 17:17:07 rogge syncserver[7033]: login: tarwe-ng.surfnet.nl
[192.87.109.23] cyrus DIGEST-MD5 User logged in

... nothing further (until another sync_client tried to connect, where
the same sequence was repeated). I was not even able to do a manual
synchronization, there was no (debug/verbose) output from the (client)
process, nothing happened.

Normally the unlock does happen (this one is after the restart of the
processes on rogge, our replica):

Feb 14 18:19:18 rogge syncserver[7344]: accepted connection
Feb 14 18:19:18 rogge syncserver[7385]: executed
Feb 14 18:19:18 rogge syncserver[7344]: cmdloop(): startup
Feb 14 18:19:18 rogge syncserver[7344]: login: tarwe-ng.surfnet.nl
[192.87.109.23] cyrus DIGEST-MD5 User logged in
Feb 14 18:19:20 rogge syncserver[7344]: Unlocked

And after that I get indeed updates, like:

Feb 14 18:19:23 rogge syncserver[7344]: seen_db: user paul opened
/data/config/imap/user/p/paul.seen
Feb 14 18:19:23 rogge syncserver[7344]: Unlocked
Feb 14 18:19:25 rogge syncserver[7344]: Unlocked
Feb 14 18:19:27 rogge syncserver[7344]: seen_db: user luuk opened
/data/config/imap/user/l/luuk.seen
Feb 14 18:19:27 rogge syncserver[7344]: Unlocked

The system is already happily replicating for another day now, but the
fact that it stalled in this way was new to me. Maybe I overlooked it in
the other cases (mwah), and I was too impatient this time ;-) but I
guess waiting for half an hour to see if the replication continues
(while noticing no worthy traffic between the hosts) should be enough.

I'm not very worried about this right now, I'd just carefully check the
replica every time, and maybe monitor this a bit better.

Regards,
Paul



More information about the Info-cyrus mailing list