Funding Cyrus High Availability

Fri Sep 17 04:05:45 EDT 2004

David Lang wrote:

>>>> Question:   Are people looking at this as both redundancy and
>>>> performance, or just redundance?
>>>
>>> Cyrus performs pretty well already. Background redundancy would be 
>>> awesome. Especially if we had control over when the syncing process 
>>> occurred either via time interval or date/time.
>>
>> I would say not at an interval but as soon as there is an action 
>> performed on one mailbox, the other one would be pushed to do 
>> something. I believe that is called rolling replication.
>>
>> I would not be really happy with a interval synchronisation. It would 
>> make it harder to use both platforms at the same time, and that is 
>> what I want as well. So there is a little-bit of load-balancing 
>> involved, but more and more _availability_.
>>
>> Being able to use both platforms at the same time maybe implies that 
>> there is either no master/slave role or that this is auto-elected 
>> between the two and that this role is floating...
>
> right, but there are already tools freely available on most platforms 
> to do the election and changing of the role (by switching between 
> config files and restarting the master) what is currently lacking is 
> any ability to do the master/slave role. once we have that it's just a 
> little scripting to tie just about any failover software in to make it 
> automatic.

There are indeed tools available for that, but they're not always 
working as they're supposed to do and are often very OS limited. With 
FreeBSD I had no luch with heartbeat (wouldn't compile under FreeBSD-5), 
(U)CARP was not available and FreeVRRP was buggy (at least in my case, 
sometimes I had two masters).

Also I wouldn't like it when restarting the cyrus-process with a 
different config-file is necessary (or there must be a seperate process 
for synchronising that needs restarting, that would make it better). 
That would still kill connections to that cyrus-process, I'd rather see 
a software switch between that role.

Isn't it possible to have equal roles? If all changes are put in some 
backlog, and a synchroniser process runs on both machines and pushes the 
backlog (as soon as there is any) to another machine... then you can 
have the some process on both (equal) servers... Of course there needs 
to be some more intelligence, but that's basicly what I would expect.

> one thing we need to watch out for here is that we don't set an 
> impossible/unreasonable goal.

I agree that we'll have to define properly what we expect and what is 
reasonable, but I think that at this moment Ken (as developer) has the 
best overview in this. We offer our wishlist, and I suppose he 
translates that to code in his head ;-)
I suppose that's why he came up with the question about performance 
versus redundancy/availability.

> don't try to solve every problem and add every availablity feater you 
> can imagine all at once. instead let's look at the building blocks 
> that are needed and identify what's currently not available.

I don't agree there completely: I don't want to depend on yet another 
tool that defines what the master or slave is. Sometimes they don't work 
at all, work only at the same LAN, ... I'm not sure if you can count on 
that.
(Hmm, you're the first that mentions the clustering software for 
defining roles, and I didn't read about this on your website either. 
This is new to me.)

> currently we have murder which will spread the load across multiple 
> machines.

Yes, that's indeed something we don't need looking at :-)
(Although there is a posibility now to spread load as well of course, 
with two machines available at the same time...)

> currently we have many tools available to detect a server failure and 
> run local scripts to reconfigure machines (HACMP on AIX, hearbeat for 
> Linux, *BSD, Solaris, etc)
>
> what we currently do not have is any ability to have one mailstore 
> updated to match changes in another one.

I would combine these two, and I think that can be done by just 
well-designing the last thing you mention.

> I also would not be really satisfied with interval synchronisation as 
> the only choice.

In my sketch above (really not sure if it works of course) where both 
have something like a backlog you can like "tail" that backlog and push 
the update as soon as possible to the second machine. You solve the 
thing you mention with delays while pushing updates to two servers at 
the same time.

> I think we need something where the primary mailstore pushes a record 
> of it's changes to the secondary mailstore

Why not also vise versa?!
We want the two servers to be accessible at the same time, right?

>> If one server is down it should mean that all tasks can be performed at the 
>> other one. I 'm curious how this would look if both servers are still running 
>> but cannot reach eachother. If there is indeeed a UUID: what if there are 
>> doubles... but I guess that has been taken into account.
>
>In cluster terminology this situation is known as being 'split-brained' 
>and is generally viewed as a 'VERY BAD THING' that each cluster software 
>solves in a slightly different way, from having an odd number of machines 
>in the cluster (so that only one half of the cluster can actually have 
>enough machines to function) to physicly disconnecting power from a 
>machine deemed to have failed (if both boxes attempt to powe each other 
>down one will generally win and avoid being shut off itself, but even if 
>they do manage to power each other down at least you avaoided the 
>split-brain situation)
>
>leave this up to the cluster software. don't try to put this in cyrus 
>initially.
>  
>
I still don't see why we need clustering software here?! I only see 
application replication, no clustering software at all - am I wrong?

If we indeed need a mechanism for UUID's for the messages, maybe one can 
define that on one server the messages are odd and on the other even, or 
that there is a different range on one server then for the other. (Not 
sure if this is really necessary, but in fact I really don't want to 
depend on clustering software.) I don't know, I supposed you already 
handled that with your patches?

Paul

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html