load balancing at fastmail.fm

Fri Jan 12 22:43:29 EST 2007

> as fastmail.fm seems to be a very big setup of cyrus nodes, I would be 
> interested to know how you organized load balancing and managing disk 
> space.
>
> Did you setup servers for a maximum of lets say 1000 mailboxes and then 
> you use a new server? Or do you use a murder installation so you can move 
> mailboxes to another server once a certain gets too much load? Or do you 
> have a big SAN storage with good mmap support behind an arbitrary amount 
> of cyrus nodes?

We don't use a murder setup. Two main reasons.
1) Murder wasn't very mature when we started
2) The main advantage murder gives you is a set of proxies (imap/pop/lmtp) 
to connect users to the appropriate backends, which we ended up using other 
software for, and a unified mailbox namespace if you want to do mailbox 
sharing, something we didn't really need either. Also the unifed mailbox 
needs a global mailboxes.db somewhere. As it was, because the skiplist 
backend mmaps the entire mailboxes.db file into memory, and we had multiple 
machines with 100M+ mailboxes.db files, I didn't really like the idea of 
dealing with a 500M+ mailboxes.db file.

We don't use a shared SAN storage. When we started out we didn't have that 
much money, so purchasing an expensive SAN unit wasn't an option.

What we have has evolved over time to our current point. Basically we now 
have a hardware set that is quite nicely balanced with regard to spool IO vs 
metadata IO vs CPU, and a storage configuration that gives us replication 
with good failure capability, but without having to waste lots of hardware 
on just having replica machines.

IMAP/POP frontend - We used to use perdition, but have now changed to nginx 
(http://blog.fastmail.fm/?p=592). As you can read from the linked blog post, 
nginx is great.

LMTP delivery - We use a custom written perl daemon that forwards lmtp 
deliveries from postfix to the appropriate backend server. It also does the 
spam scanning, virus checking and a bunch of other in house stuff.

Servers - We use servers with attached SATA-to-SCSI RAID units with battery 
backed up caches. We have a mix of large drives for the email spool, and 
smaller faster drives for meta-data. That's the reason we sponsored the 
metapartition config options 
(http://cyrusimap.web.cmu.edu/imapd/changes.html).

Replication - We initial started with pairs of machines, half of each being 
a replica and half a master replicating between each other, but that meant 
on a failure, one machine became fully loaded with masters. masters take a 
much bigger IO hit than replicas. Instead we went with a system we calls 
"slots" and "stores". Each machine is divided into a set of "slots". "slots" 
from different machines are then paired as a replicated "store" with a 
master and replica. So say you have 20 slots per machine (half master, half 
replica), and 10 machines, then if one machine fails, on average you only 
have to distribute one more master slot to each of the other machines. Much 
better on IO. Some more details in this blog post on our replication 
trials... http://blog.fastmail.fm/?p=576

Yep, this means we need quite a bit more software to manage the setup, but 
now that it's done, it's quite nice and works well. For maintenance, we can 
safely fail all masters off a server in a few minutes, about 10-30 seconds a 
store. Then we can take the machine down, do whatever we want, bring it back 
up, wait for replication to catch up again, then fail any masters we want 
back on to the server.

Unfortunately most of this software is in house and quite specific to our 
setup, it's not very "generic" (e.g. it assumes particular disk layouts and 
sizes, machines, database tables, hostnames, etc) to manage and track it 
all, so it's not something we're going to release.

Rob