Cyrus crashed on redundant platform - need better availability?

Paul Dekkers Paul.Dekkers at surfnet.nl
Fri Sep 10 07:24:42 EDT 2004


Hi,

We're implementing a new mailplatform running on two dell 2650-servers 
(2 xeon cpu's with each 3 Ghz, HTT and 3Gb of memory) and with a disk 
array of 4 Tb connected with a adaptec 39160 scsi controller for 
storage. We installed FreeBSD 5.2.1 on it, and - of course - cyrus 2.2.8 
(from the ports) as IMAP server. Our MTA is postfix.
There are two machines for redundancy. If one fails, the other one 
should take over: mount the disks from the array, and move on.

Unfortunally, the primary server crashed twice already. The first time 
it did while synchronising two IMAP-spools from the old server to the 
new one. There was not much data on it back then. The second time was 
worse, around 10Gb of mail was stored on the disks. We discovered that 
the fsck took about 30 minutes, so although we have two machines for 
redundancy it takes still quite some time before the mail is available 
again. (And we still have about 90 Gb of mail to migrate, so when all 
users are migrated it takes much longer.)
I mounted the filesystems synchronous now: although it slows down the 
system I hope it speeds up the fsck a bit when there is another crash.
The second crash was while removing a lot of mailboxes (dm) while some 
of them where removed the same time using a webmail app (squirrelmail).

I'm not sure why the box crashed; there was nothing in the logs, there 
was nothing on the screen when we came there, it just booted up again. 
Of course I'm interested if anyone has any thoughts on this.

Although many on the list claim that this (having 2 boxes with 1 
disk-array) is a nice way for redundancy I'm in doubt now if this is 
true. It still takes 30 mins before everything is back again! It seems 
to me that if there was a "live" version of cyrus available with a 
synchronised mail-spool, that there was no outage noticeable for users 
(except in losing a connection maybe). Am I right?

Maybe it's time to continue on the "High availability ... 
again"-discussion we had a while ago. If the cyrus developers are able 
to implement this with some funding there are still some questions left 
for me: how much time would it take before a "stable" solution is ready? 
How many funding is expected? I still have to talk to management about 
this, but I would really support this development and I'm certainly 
willing to convince some managers.

Regards,
Paul


---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




More information about the Info-cyrus mailing list