LARGE single-system Cyrus installs?

Thu Oct 4 19:32:52 EDT 2007

> Anyhow, just wondering if we the lone rangers on this particular
> edge of the envelope.  We alleviated the problem short-term by
> recycling some V240 class systems with arrays into Cyrus boxes
> with about 3,500 users each, and brought our 2 big Cyrus units
> down to 13K-14K users each which seems to work okay.

FastMail has many 100,000's of users in a full replicated setup spread 
across 10 backend servers (+ separate MX/Spam/Web/Frontend servers). We use 
IBM servers with some off the shelf SATA-to-SCSI RAID DAS (eg like 
http://www.areasys.com/area.aspx?m=PSS-6120). Hardware will die at some 
stage, that's what replication is for.

Over the years we've tuned a number of things to get the best possible 
performance. The biggest things we found:

1. Using the status cache was a big win for us

I did some analysis at one stage, and found that most IMAP clients issue 
STATUS calls to every mailbox a user has on a regular basis (every 5 minutes 
or so on average, but users can usually change it) so they can update the 
unread count on every mailbox. The default status code has to iterate over 
the entire cyrus.index file to get the unread count.

Although the cyrus.index file is the smallest file, with 10,000's of users 
connected with clients doing this regularly for every folder, it basically 
means you either have to have enough memory to keep every cyrus.index hot in 
memory, or every 5-15 minutes you'll be forcing a re-read of gigabytes of 
data from disk, or you need a better way.

The better way was to have a status cache.

http://cyrus.brong.fastmail.fm/#cyrus-statuscache-2.3.8.diff

This helped reduce meta data IO a lot for us.

2. Split your email data + metadata IO

With the 12 drive SATA-to-SCSI arrays, we get 4 x 150G 10k RPM WD Raptor 
drives + 8 x (largest you can get) drives. We then build 2 x 2 drive RAID1 + 
2 x 4 drive RAID5 arrays. We use the RAID1 arrays for the meta data (cyrus.* 
except squatter) and the RAID5 arrays for the email data. We find the email 
to meta ratio about 20-to-1, higher if you have squatter files, so 150G of 
meta will support up to 3T of email data fine.

>From our iostat data, this seems to be a nice balance. Checking iostat, a 
rough estimate shows meta data get 2 x the rkB/s and 3 x the wkB/s vs the 
email spool, even though it's 1/20th the data size and we have the status 
cache patch! Basically the meta data is "very hot", so optimising access to 
it is important.

3. Not really related to cyrus, but we switched from perdition to nginx as a 
frontend POP/IMAP proxy a while back. If you've got lots of IMAP 
connections, it's really a worthwhile improvement.

http://blog.fastmail.fm/2007/01/04/webimappop-frontend-proxies-changed-to-nginx/

4. Lots of other little things

a) putting the proc dir on tmpfs is a good idea
b) make sure you have the right filesystem (on linux, reiserfs is much 
better than ext3 even with ext3s dir hashing) and journaling modes

> That is our hypothesis right now, that the application has certain limits
> and if you go beyond a certain number of very active users on a
> single backend bad things happen.

Every application has that problem at some point. Consider something that 
uses CPU only, and every new unit of work takes the CPU 0.1 seconds, then 
you can handle 1-10 units of work arriving per second no problem. If 11 
units per second arrive, then after 1 second you'll have done 10, but have 1 
unit still to do, but another 11 units arrive in that next second again. 
Basically your outstanding work queue will grow forever in theory.

cyrus isn't CPU limited by a long shot, but it can easily become IO limited. 
This same effect happens with IO, it's just more noticeable because disks 
are slow. Basically if you start issuing IO requests to a disk system and it 
can't keep up, the IO queue grows quickly and the system starts crawling.

The only way to improve it is reduce your IOPs (eg less users or optimise 
the application to issue less IOPs in some way) or increase the IOPs your 
disk system can handle (eg more spindles, faster spindles, NVRAM, etc). 
That's what 1 (reduce IOPs application generates) & 2 (put hot data on 
faster spindles) above are both about, rather than the other option (reduce 
users per server).

Rob