failover scenario's for replication

Mon Aug 28 20:40:17 EDT 2006

On Mon, Aug 28, 2006 at 02:23:22PM -0400, Wesley Craig wrote:
> On 26 Aug 2006, at 16:09, Paul Dekkers wrote:
> >Right now, it looks tricky to me to enable replication after failover,
> >or the replicated machine itself if you're not sure that the  
> >replica is
> >identical and the sync-processes finished completely: if a message- 
> >file
> >is in place on machine A (say "7.") but it was not replicated to  
> >machine
> >B while that one becomes the master, the machine B will create a new
> >file 7. and both machines consider this file synchronised after that:
> >also if roles switch back, you have two different (with one isolated)
> >copies of 7.
> 
> As I understand it, this is what replication uuids are for.  Not that  
> I've experimented with this particular case.

All that replication UUIDs do is make sure that the copy of '7' on
the master overwrites the copy of '7' on the replica.  It doesn't make
any attempt to retain '7' from the replica.

> >Or is it only preferred to use a replica if there is a really serious
> >crash on the (previous) master?
> 
> That's certainly how I view the current system.  Until replication is  
> more reliable, I'd be quite leery of any sort of automatic failover.

Ditto.  Our 'init scripts' actually check a database table to see which
role a particular instance on a machine has and then starts up in that
mode.  Changing over the database table entry is a manual step.

The master init script also attempts to run the remaining log files with
sync client if there are any.  Sadly, sync_client doesn't interact well
with real-time requriements and the replica being away.  Bah.  I'll
get back to my "-o" => "only try to connect ONCE" patch one day.

> >It sounds nice to me if I could use heartbeat or (u)carp (/ifstated)
> >like systems to start and stop a sync_client or sync_server copy of
> >cyrus (both different cyrus.conf) as soon as the state of the virtual
> >interface changes, but then it is even more likely that some  
> >replication
> >process is not finished without an admin even noticing it.
> 
> I agree, this is a great goal.  I'd be interested in seeing a roadmap  
> for how to achieve it, including how failback would occur.  There's a  
> lot of opportunity to share operational experience with Cyrus.  If  
> only there was a forum to publish such information...

Yeah, I've had a play with using heartbeat.  The downside is that its
colocation works, but ordering operations without having dependencies
take the other side down as well doesn't work properly.  You can't say
"always start the master in preference" and "start the replica first
if you can" (makes master startup actually work at the moment!).

I might look at it again in a bit though, 2.0.7 looks nicer than 2.0.5
was so far in terms of tools working sanely.

Bron.