Miserable performance of cyrus-imapd 2.3.9 -- seems to be locking issues

Jeff Fookson jfookson at as.arizona.edu
Thu Feb 28 16:38:37 EST 2008


Folks-

I am hoping to get some help and guidance as to why our installation of 
cyrus-imapd 2.3.9
is unusably slow. Here are the specifics:

The software is running on a 1.6GHz Opteron with 2Gb memory supporting a 
user base of about 400
users. The average rate of arriving mail is on the order of 1-2 
messages/sec. The active mailstore
is about 200GB.  There are typically about 200  'imapd'
processes at a given time and a hugely varying number of 'lmtpds' (from 
about 6 to many hundreds during
times of greatest pathology). System load is correspondingly in the 2-15 
range, but can spike to 50-70!

Our users complain that the system is extremely sluggish during the day 
when the system is most busy.

The most obvious thing we observe is that both the lmtpds and the imapds 
are spending HUGE times waiting
on locks. Even when the system load is only 1-2, an 'strace' attached to 
an instance of lmtpd or imapd shows
waits of  upwards of 1-2 minutes to get a write lock as shown by the 
example below (this is from a trace of an 'lmtpd')

[strace -f -p 9817 -T]
9817  fcntl(10, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, 
len=0}) = 0 <84.998159>

We strongly suspect that these large times waiting on locks is what is 
causing the slowness our users are reporting.

We are under the impression that a single instance of cyrus-imapd scales 
well up to about 1000 users (with about 1MB active
memory per 'imapd' process),  and so we are baffled as to what might be 
going on.

A non-standard aspect of our installation which may have something to do 
with the problem is that we are
running cyrus on an lvm2 partition that itself is running on top of 
drbd. Thinking that the remote writes
to the drbd secondary might be causing delays, we put the primary in 
stand-alone mode so that the drbd layer
was not doing any network activity (the drbd link is running at gigabit 
speed on its own crossover cable to
the secondary box) and saw no significant change in behavior. Any issues 
due to locking and the lvm2 layer
would, of course, still be present even with drbd's activity reduced to 
just local writes.

Can anyone suggest what we might do next to debug the problem further? 
Needless to say, our users get
extremely unhappy when trivial operations in their mail clients take 
over a minute to complete.

Thank you for any thoughts or advice.

Jeff Fookson

-- 
Jeffrey E. Fookson, PhD			Phone: (520) 621 3091
Support Systems Analyst, Principal	jfookson at as.arizona.edu
Steward Observatory
University of Arizona



More information about the Info-cyrus mailing list