Cyrus crashed on redundant platform - need better availability?

Fri Sep 10 10:32:33 EDT 2004

Jure PeÄar wrote:

>>Although many on the list claim that this (having 2 boxes with 1 
>>disk-array) is a nice way for redundancy I'm in doubt now if this is 
>>true. It still takes 30 mins before everything is back again! It seems 
>>to me that if there was a "live" version of cyrus available with a 
>>synchronised mail-spool, that there was no outage noticeable for users 
>>(except in losing a connection maybe). Am I right?
>>    
>>
>Having 2 boxes with one disk array leaves you wit a single point of failure
>that you wouldn't think of immediately: filesystem. I learned that the hard
>way.
>  
>
Yes, I agree.

>I'm planning to 'redesign' our storage: instead of one big volume that fscks
>for hours, i'm going to split in in many mirrors and use them as cyrus
>partitions. This way they could all fsck in parrallel. I'm going to lose the
>'single instance store' capability, but thats a tradeoff that i'm willing to
>take.
>  
>
Hmm, then your fscks will run faster/with less problems, but there is 
still outage that you can prevent if there is failover in another way 
and availability/replication on the application level.
If there are replicated spools it doesn't matter if the fsck takes long 
or not... although there will be a backlog of course.

Is it possible to have an fsck running on one partition and have cyrus 
started already (so part of the mail-store, e.g. archives, is not 
available yet?)

>It happened to me at least once that the machine that crashed corrupted the
>filesystem in a way that the machine that took over also crashed within
>hours... 
>  
>
>>Maybe it's time to continue on the "High availability ... 
>>again"-discussion we had a while ago. If the cyrus developers are able 
>>to implement this with some funding there are still some questions left 
>>for me: how much time would it take before a "stable" solution is ready? 
>>How many funding is expected? I still have to talk to management about 
>>this, but I would really support this development and I'm certainly 
>>willing to convince some managers.
>>    
>>
>The only high availability i see here is the google way. Cyrus is offering
>you that with the 'murder' component.
>  
>
That's not really availability, but distributed risk.

>BTW, you're mentioning FreeBSD ... doesn't it have some sort of background
>fsck while the filesystem is moutned rw? 
>  
>
It can, but I'm not sure if that's what I prefer. I'm not sure how 
mature it is with FreeBSD, and I prefer to have mail-integrety over a 
"quick restore".

Paul

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html