Cyrus crashed on redundant platform - need better availability?

Wed Sep 15 07:38:43 EDT 2004

Hi,

Sebastian Hagedorn wrote:

>> You are not using a clustered filesystem,
>> right?
>
> No.

I can imagine that would be one of the advantages of RH's clustering, 
since you don't have to mount a filesystem in that case for a machine 
that just crashed - it would safe time...
But I suppose RH's cluster manager takes care of mounting the partitions 
and checking them if there are any errors.

>>> It's good but not perfect. We recently installed a huge SAN and are
>>> now in the process of moving over the mail data to reside there.
>>> Fibrechannel seems to be much more error tolerant than SCSI.
>>
Where you working with a "multi-initiator enviroment" (as RH calls it) 
or "single initiator" (e.g. with 2 machines on exactly the same SCSI 
bus, or two seperate interfaces on your array's SCSI controller?)
I think with a multi-initiator enviroment (as we have it) there is a 
very limited chance of failures.

>> Hmm, I don't expect the problems to be SCSI-related. Maybe it has to 
>> do...
>
> That's not what I was talking about. We have a similar setup, yet 
> still there were instances when Red Hat's cluster software failed to 
> write to the shared storage. I guess this was caused by the slow-downs 
> connected to the memory management, but Red Hat support indicated that 
> shared storage connected via FibreChannel would not have been as 
> susceptible to these problems.

Do you think using RH's cluster software is a valuable consideration for 
this kind of clustering setup? Using FreeBSD there are not that many 
clustering solutions for now, and if it's advisable to at least consider 
using RH here (although I have no experience with RH) we can certainly 
look at it. (Any idea how fast RH would "recover services"?)

On the other hand, if there is a application level redundancy on its 
way, it doesn't really matter on what platform the machine runs, so it 
would still make me happier and even with FreeBSD. And I would rather 
put my money there. Even if it means we'll have to wait for some months, 
we would do that and take the risk of running on a "less 
automatic-failover-situation" with a worst-case downtime of 30 mins (or 
2 mins regulary with sync-mounted filesystems now).

>> The kernel that shipped with RedHat AS 2.1 was useless for most of the
>> tasks i tried it with. About three revisions later it became somewhat
>> more usefull for non-oracle types of use, but i've rolled my own and am
>> not following the state of it now.
>
> That's fine if you don't have to rely on commercial support. Our 
> management decided to go the supported path all the way. That doesn't 
> leave you many options. I have to say that when it works, the cluster 
> software works extremely well. It's just that it hasn't always worked 
> in the past ... ;-)

That's a plus for RH (ES|AS) 3

Paul

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html