Problems with load balancing cluster on GFS

Jens Hoffrichter jens.hoffrichter at gmail.com
Thu Jun 5 16:03:11 EDT 2008


Hello everyone,

I hope this is the correct mailing list to post this problem on.

I'm seeing some weird behaviour with the pop3 daemon on a GFS HA
cluster with load balancing.

The general situation is as follows:

I have 3 servers here, everyone installed with CentOS 5.1 and the
latest RedHat cluster. On every server is a cyrus 2.3.12p2 from the
Invoca distribution.
he
The servers share two common partitions for data storage on an SAN,
one 1 GB partition mounted on /var/lib/imap, and one 1.2TB partition
mounted on /var/spool/imap. On the /var/lib/imap partition I have set
up the following directories so they point to individual directories
for each node: backup, proc and socket. The backup directory was made
separately because some cron.daily entries locked each other up in the
night, rendering the cluster useless.

In front of the three backend servers is a load balancer, which
balances pop3, imap, lmtp and timsieved on a round robin basis to each
node.

The load balancer is used (or will be used ;) ) by two perdition
servers which connect to the pop or imap port on the LB, which
distributes them to a running node.

The idea behind this is that we can shut down any node without a
notable service interruption, and we only have one backend system
instead of several one. We want to migrate away from a murder based
setup, so any comments in that direction won't be very useful for me
at this stage ;)

The problematic behaviour I see at the moment:

I have migrated ~100 test mailboxes from the old backend system, and
I'm in the process of performing load tests on the new system to get
an impression how the performance will be, and if we are on the right
track. From the mailboxes around 80 are empty, 10 are medium filled
and 10 are filled to the maximum storage, which is about the
distribution we will be talking about after putting the system live.

The load test is performed with jakarta-jmeter from apache.org, which
chooses one of the mailboxes, and performs either a pop-3 or imap
login to the backend, using the load balancer. The distribution is
roughly that I do 5 pop3 logins for 1 imap login, with a performance
about 5 logins/sec.

After 30 to 60 seconds into the test, randomly one of the backend
servers pop3ds will stop working. It is still accepting connections,
but doesn't send a banner anymore. This is recognized by the load
balancer as "working" (as the port is still open), but one after
another all my connections will hit the malfunctioning server and the
test basically stalls.

A restart of the cyrus service stops the problem for another 30 - 60
seconds. If I just stop the one offending server, so it won't be used
by the LB anymore, the test usually finishes without a problem......

At first I thought that this was a problem related to entropy, but it
even persisted after I turned off "allowapop", and unconfigured
everything relating to TLS (as SSL/TLS will be handled completely by
the perdition, we don't need it)

My personal guess is that it is somehow related to the port tests by
the load balancer, as normally a connection from the load balancer is
the last thing I see in the log of the offending backend server. The
port tests are easily distinguishable, as the LB just opens a TCP
connection and instantly resets it before it reads any data from the
pop3d, not even waiting for a banner. After this happens, there are no
more log entries regarding pop3d, or log entries from the master that
it spawns new pop3 processes.

My second guess was that it is related to locking, but the IMAP server
just continues to run fine, and doesn't have a problem.

At the moment, I'm running out of ideas where to look, and my
knowledge about cyrus debugging is quite limited (never had such a
problem before ;) ), so any ideas or points how to debug the problem
would be appreciated.

Oh yes, I tried to strace the pop3d, and from the pop3d which
generates the last log entry normally comes a SIGPIPE, as the end
point isn't connected anymore to the pop3d.

It looks a bit like master doesn't recognize that there is a problem
regarding spawning off new children, and assigns new connections to a
dysfunctional pop3d.

Any ideas, hints, questions will be greatly appreciated, if
information is missing I will provide what I can :)

Thanks in advance!

Regards,
Jens


More information about the Info-cyrus mailing list