UC Davis Cyrus Incident September 2007

Vincent Fox vbfox at ucdavis.edu
Tue Oct 16 18:39:52 EDT 2007


So here's the story of the UC Davis (no, not Berkeley) Cyrus conversion.....

We have about 60K active accounts, and another 10K that are forwards, etc.
10 UWash servers that were struggling to keep up with a load that was 2006
running around 2 million incoming emails a day, before spam dumpage, etc.

Moving from this UWash setup that was pretty grody, with each chunk of
4K-5K users knowing what "color" server was theirs.   Like if you had an
account jsmith1 you could look up on our web-site that you connected to
yellow.ucdavis.edu or whatever.  Moving accounts was a giant PITA
because if we had to move people they had to be notified they were
moving to a new "color".

We knew we had to do SOMETHING and during an academic year you
politically CANNOT just plonk in a Cyrus Murder and say go,  so we started
putting new infrastructure up to ease our eventual move.

1st STEP:  Perdition mail-proxies
=========================
We setup 4 Sun X2100 with RHEL4 running Perdition to answer
to mail.ucdavis.edu and redirect users to the right backend.  These are
in a load-balanced pool and 2 can handle the load most days. Initially we
were pulling the redirects from a NIS map, then later from LDAP.

2nd STEP:  LDAP infrastructure
========================
So we want all delivery lookups and Perdition redirect to come from
LDAP.  We created another load-balanced pool with 4 little Sun V210
running Sun Directory Server 5.2 with a pair as hubs and a pair as
consumers.  So far they have held up admirably to our needs.  The Sun
course for DS is great by the way for fine-tuning performance tips.

3rd STEP:  MX pulling from LDAP
===========================
Modified our MX pool systems to pull from LDAP instead of NIS.
This went without incident, although we did see occasional lookup
errors before we started tuning the LDAP servers to increase threads.

4th STEP: Cyrus infrastructure creation
============================
After much consultation with other universities we decide on
Sun systems in the backend mail-stores.  We went with Sun T2000
with 2 HBA wired to separate SAN switches, with dual 3510FC
arrays.  The intent was to have a Sun Cluster 3.2 setup in failover
mode so if any single major component failed there would be no
service interruption.  We had background already in Sun HA and
similar so it seemed less man-hours than starting over with a Linux
based solution and trying to cobble something similar together.

We went with ZFS as our filesystem format for Cyrus storage
and this has worked out well.  The snapshots and ability to survive
minor disk write errors in a mirrored setup like we have, let us all
sleep a lot better.  I recall 6 times in the prior year where UFS errors
gave us grief.

We ran a truckload of benchmarks against our hardware using
SLAMD and it seemed to indicate we were in great shape for
30K-40K users per cluster.

5th STEP: Cyrus migration
====================

The politics of educational environment is that you MUST do massive
changeouts like this during summer quarter.  So the last couple of months
of summer we were busily migrating all the UWash users to Cyrus.
About 29K users to ms1, and 23K users to ms2.  Everything worked great.
Typically about 500 Cyrus processes running.

6th STEP:  The excrement hits the rotating blades
===================================

About a week before classes actually start is when all the kids start moving
back into town and mailing all their buds.  We saw process numbers go
from 500-ish to as high as 5,000.  Load would climb radically after passing
2,000 processes and systems became slow to respond.  This persisted for
4 days with us on the phone with Ken & Jeff and anyone else who would
talk to us, trying to find the right tweaks on the Cyrus software.  We tried
moving to quota-legacy and using BDB for delivery database a few other
tweaks suggested, but none brought us substantial relief.

We are running as high as 4 million emails arriving a day now, which
is about double of last year, with say 1-1.5 million being dumped by the
virus/spam filtering.  So about 2-2.5 million arriving to backend 
mailstores.

Meanwhile I was scavenging the bones of the UWash infrastructure and
rebuilt them as Cyrus systems.  So we migrated some users BACK off our
big new boxes onto smaller ones.  The magic point seemed to be below 15K
users on a T2000 we were fine.  We are all Cyrus now so the migrations
are considerably less difficult.

7th STEP:  Post-Mortem
===================
I don't know what goes wrong precisely although we have lots of speculation.

I do know that the processes were piling up.  We have various candidates
for this in both the OS and the application.  There is some bottleneck for
a resource that once it reaches a certain busy-ness level, everything starts
backing up.  No amount of dtrace or truss fiddling pinned it down further
than possibly locking issues on the many databases.  Frankly we ran out of
time due to user pressure after 4 days to dig completely to the bottom 
of it.

I would caution large sites in future that more than 10K users per 
backend with
a high email volume is heading into unknown territory.  We have talked 
to a few
sites carrying 30K and up users per system, but so far all with much 
less email
activity levels.

We are actually having a Sun Engineer on-site in a few days and will try
to see if we can pinpoint some issues, or at least find usable workarounds
on our hardware such as Zones.  The theory being 2 or 3 Zones on a T2000
with say 10K users each, would still let us accomodate the same number of
users on the new hardware that we had originally targetted.

I append the comments of one of our people with local theory:

------------ Omen Wild (University of California Davis)
The root problem seems to be an interaction between Solaris' concept of
global memory consistency and the fact that Cyrus spawns many processes
that all memory map (mmap) the same file.  Whenever any process updates
any part of a memory mapped file, Solaris freezes all of the processes
that have that file mmaped, updates their memory tables, and then
re-schedules the processes to run.  When we have problems we see the load
average go extremely high and no useful work gets done by Cyrus.  Logins
get processed by saslauthd, but listing an inbox either takes a long
time or completely times out.

Apparently AIX also runs into this issue.  I talked to one email
administrator that had this exact issue under AIX.  That admin talked to
the kernel engineers at IBM who explained that this is a feature, not a
bug.  They eventually switched to Linux which solved their issues,
although they did move to more Linux boxes with fewer users per box.

We have supporting evidence in the fact that sar shows that the %sys CPU
time is 2-3x the %usr CPU time:
00:02:01    %usr    %sys    %wio   %idle
[ snip ]
11:17:01       3       7       0      90
11:32:02       3       7       0      90
11:47:01       3       7       0      90
12:02:01       3       7       0      91
12:17:01       3       7       0      90
12:32:01       3       6       0      91
12:47:01       3       6       0      91
13:02:01       3       6       0      92
13:17:01       3       6       0      91
13:32:01       3       6       0      92
13:47:02       3       6       0      92
14:02:01       3       6       0      91
14:17:01       3       7       0      90
14:33:54       2       4       0      94
14:47:01       4      10       0      86





More information about the Info-cyrus mailing list