replication

Fri Nov 12 21:59:39 EST 2010

> Quoting Bron Gondwana <brong at fastmail.fm>:
> >
> > It's getting better, but it's still not 100% reliable to have
> > master/master replication between two servers with interactions
> > going to both sides.
> >
> > It SHOULD be safe now to have a single master/master setup with
> > individual users on one side or the other - but note that nobody
> > is known to be running that setup successfully yet.
> >
> > As for what the point is?  I don't know about you, but I run a
> > 24hr/day shop, and I like to be able to take a server down for
> > maintainence in about 2 minutes, with users seeing a brief
> > disconnection and then being able to keep using the service
> > with minimal disruption.
> >
> > Bron.
> 
> As Bron already mentioned the problems of master/master mode
> you can easy live without.
> 
> We run multiple servers, these are paired, each server is running one
> cyrus instance in as master and one as slave, so that the pairs
> replicate each other. In case of a crash one server would run two  
> master instances.
> 
> You only need a way of splitting the users between the  servers.
> That could be DNS, a proxy or murder setup.

Are you using local storage on each server for spool and metadata?

How good/bad is the idea of using shared storage (an external SAN
chassis) and letting multiple servers keep their spool areas there? Can
one set up, say, half a dozen servers in a pool, each using a separate
LUN for spool+data on a common back-end SAN chassis? Out of the six
servers, one would be a hot spare, standing by. If any of the five active
servers failed, the standby would be told to mount the failed server's
LUN, borrow the failed server's IP address, and start offering services?

In this proposed model, each user's account is on one "physical" server
(i.e. bound to a specific IP address). No load balancing or connection
spreading is needed when clients connect. If the site chooses to use
Murder, then the proposed model can apply to the back-end while the
multiplexer can take care of the front-end.

The only thing I'm not sure about is the file system corruption when a
node goes down and the time taken for an fsck before the standby node can
assume the failed node's role. I wonder whether something like the ext4
will help reduce fsck timings to acceptable levels.

Is this a good idea for a scalable fault-tolerant Cyrus setup? I've been
toying with this approach for some time, for a proposed large-system design.

Shuvam