UC Davis Cyrus Incident September 2007
David Lang
david.lang at digitalinsight.com
Wed Oct 17 18:07:06 EDT 2007
>> ------------ Omen Wild (University of California Davis)
>> The root problem seems to be an interaction between Solaris' concept of
>> global memory consistency and the fact that Cyrus spawns many processes
>> that all memory map (mmap) the same file. Whenever any process updates
>> any part of a memory mapped file, Solaris freezes all of the processes
>> that have that file mmaped, updates their memory tables, and then
>> re-schedules the processes to run. When we have problems we see the load
>> average go extremely high and no useful work gets done by Cyrus. Logins
>> get processed by saslauthd, but listing an inbox either takes a long
>> time or completely times out.
>>
>> Apparently AIX also runs into this issue. I talked to one email
>> administrator that had this exact issue under AIX. That admin talked to
>> the kernel engineers at IBM who explained that this is a feature, not a
>> bug. They eventually switched to Linux which solved their issues,
>> although they did move to more Linux boxes with fewer users per box.
>
>Oh man... Horrible memories just flood right back... Wow. I was reading
>your e-mail and thinking to myself that this sounded like the same problem
>we had. Then I got to the above section and *bam*, there it was... We
>had significant problems with our e-mail last year (this year was a perfect
>start!) a week before students came back. We didn't resolve the problems
>until the end of September and we were dismayed at our final solution.
>
>We run Tru64 5.1b on a 4 member cluster. Tru64's kernel suffers from the
>same exact issue as described above. We have regularly 12,000 cyrus procs
>running at any one time during the day, and that cluster also receives on
>average 300k-500k e-mails each day (that is after spam/virus work).
>
>What was finally identified was that the number of "processes" that were
>mapped to that single physical "executable" (/usr/cyrus/imapd) was causing
>a lot of lock contention in the kernel. The executable would have a link
>list of all the processes running off of it in kernel memory. When one
>of the processes would go away, the kernel would start at the beginning
>of the list and search for the process in order to clean up its resources.
>During that time, the kernel would lock everything and execution would
>essentially stop for everything (basically, the whole system appeared to
>simply freeze on us). The kernel would reach a time threshold and stop
>in order to let other things happen (unfreeze). This time was very short,
>but if we had a lot of processes going away in a very short period of time,
>we would noticeably see the freeze, since the kernel was going into this
>lock-down mode a lot in a very short period of time. That is a simplified
>view of what really happened.
could someone whip up a small test that could be used to check different
operating systems (and filesystems) for this concurrancy problem?
it doesn't even need to use any cyrus code, (in fact it would probably be better
if it didn't)
it sounds like there are a couple different aspects to check
1. large number of copies of a single program running, find the impact of
starting and stopping a process
1a. single process that forks lots of copies
1b. master process that execs lots of copies
2. large number of processes mmapping a single file.
2a. impact to add or remove a process from this group
2b. impact on modifying this file
personally I expect 1b and 1a to be significantly different on different OSs.
some OSs will gain huge memory savings in 1a due to copy-on-write savings (and
to partially account for this it may be worth making the program allocate a
chink of ram and write to it after the fork), while on other OSs the overhead of
multiple mappings of a page will dominate.
David Lang
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20071017/6c6a3f67/attachment-0001.ksh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 229 bytes
Desc: not available
Url : http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20071017/6c6a3f67/attachment-0001.bin
-------------- next part --------------
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus
mailing list