LARGE single-system Cyrus installs?

Michael Bacon baconm at
Tue Nov 13 16:41:14 EST 2007

At the risk of being yet one more techie who thinks he has a workaround...

I'm back (in the past two months) doing Cyrus administration after a three 
year break.  I ran Cyrus instance at Duke University before, and am now 
getting up to speed to run the one at UNC.  At Duke we started as a 
multi-host install, and moved to a single instance just as I was leaving. 
Here at UNC, we've been on a single instance for years.  Both places have 
been Solaris all along, and both places had over 50k users and receiving 
several million messages a day.

Part of the way we handle it here is with massive hardware -- An 8 
processor Sun 6800 with the processor boards swapped out to UltraSparc 4s. 
These are still a couple of years old at this point.  That said, our CPU 
load is really pretty minimal.

While we're on a very old version of Cyrus right now (1.6), I think reading 
this that I've got a good feel for what you're looking at.  There's been a 
lot of talk about the linked list in the kernel and the fact that it 
freezes all processes with that file mmap'ed when the file gets written. 
If the spanning of the linked list were really the problem, I think we 
would have seen a total system meltdown here a long time ago.

I'm much more inclined to think that what you're running into is all of the 
processes freezing during the latency period for the re-write of the 
mailboxes file.  This won't show up as I/O blocking on your disk, as there 
won't be any real contingency for that file or even for the channel.  But 
the latency of the write, while only a few milliseconds, is going to kill 
you if your mailboxes file gets big.

I haven't had any role yet in the design and configuration of UNC's system, 
but there's one thing we have that I think saves us an enormous amount of 
pain.  Since we're still on 1.6, and hence using the "plain text" mailboxes 
format, bear in mind that all changes to the mailboxes database involve a 
lock on the file, a complete rewrite of the file next to it on the file 
system, and a rename() system call.  This is SLOOOWWW.  How are we not dead?

Solid state disk for the partition with the mailboxes database.

This thing is amazing.  We've got one of the gizmos with a battery backup 
and a RAID array of Winchester disks that it writes off to if it loses 
power, but the latency levels on this thing are non-existent.  Writes to 
the mailboxes database return almost instantaneously when compared to 
regular spinning disks.  Based on my experience, that's bound to be a much 
bigger chunk of time than traversing a linked list in kernel memory.

For anyone doing a big Cyrus install, I would strongly recommend this.

Michael Bacon
ITS - UNC Chapel Hill

--On Friday, November 09, 2007 10:35 AM -0800 Vincent Fox 
<vbfox at> wrote:

> Jure Pečar wrote:
>> In my expirience the "brick wall" you describe is what happens when disks
>> reach a certain point of random IO that they cannot keep up with.
> The problem with a technical audience, is that everyone thinks they have
> a workaround
> or probable fix you haven't already thought of.  No offense. I am guilty
> of it myself but
> it's very hard to sometimes say "I DON'T KNOW" and dig through telemetry
> and instrument the software until you know all the answers.
> With something as complex as Cyrus, this is harder than you think.
> Unfortunately when it comes to something like a production mail service
> these days it's nearly impossible to get the funding and manhours and
> approvals to run experiments on live guinea pigs to really get to the
> bottom of problems.  We throw systems at the problem and move on.
> But in answer to your point, our iostat numbers for busy or service time
> didn't
> indicate there to be any I/O issue.  That was the first thing we looked
> at of course.
> Even by eyeball our array drives are more idle than busy.
> ----
> Cyrus Home Page:
> Cyrus Wiki/FAQ:
> List Archives/Info:

More information about the Info-cyrus mailing list