mailboxes.db locking problem after updating from 2.4 to 2.5.9

Tue Nov 29 10:57:31 EST 2016

Well, I did another try to move to 2.5 branch.
I make some preparation this time:

1. took 2.5.10 and put twoskip & skiplist patches from master
[PATCH] cyrusdb: add "CYRUSDB_NOCOMPACT" open flag to avoid
[PATCH] twoskip: release the readlock in foreach every 256 misses
2. put mailboxes.db to /dev/shm
3. boot with lazytime mount option and do some tune of disks
4. use latest kernel and gcc to compile cyrus

I managed to update cyrus at Fri evening (using twoskip for mailboxes.db
and skiplist for other tables, improved_mboxlist_sort:1) and it ran till
Mon morning.

Ran not well -  list command ran up to 3 sec when it take ~ 0.5 with
skiplist on 2.4 - no long locks at least.

CPU usage was strange - few core were loaded up to 100% time to time
while others like 25-30% with ~40 simultaneous imap connections, mostly
short from webmail client. overal load was 2-3 on 8 core cpu

There is a strace output from one of the imapd process
https://justpaste.it/10waf
Looks like there is a issue with finding last record from maiboxes.db -
that took a lot of time and locking attempts

Things went realy bad at Mon 10am when number of connections increased
up to 120 and begun to grow.
imap processes started to lock with 100% cpu usage.
I tried to set limit for 100 of imapd count - not helped.
Tried to convert to skiplist without success:
Nov 28 12:04:29 srv1 imap[15926]: skiplist: longlock
/var/imap/mailboxes.db for 259.4 seconds
Nov 28 12:04:29 srv imap[15779]: skiplist: longlock
/var/imap/mailboxes.db for 263.1 seconds

output of atop:  http://pastebin.com/raw/QVm4hkK8
vmstat: http://pastebin.com/raw/geP3NnqL

so i went back to 2.4 aonce again. Load dropped to ~2 on 8 core cpu in
half of hour.

I did test on same hardware with same mailboxes.db file running
following commands in loop in parallel for 400 concurrent sessions for
random users for few days without any performance degradation:
0 login
0 list "" "*"
0 CREATE $fldr
0 SELECT $fldr
0 logout

list completed in 0.001 sec mostly
load average: 49.70, 41.49, 38.72
# netstat -anp | grep EST | grep 143 | wc -l
248

It looks like there is problem of locking mailboxes.db in code not in
LIST command.
May be new mailboxes.db traversing code has some pitfalls ?

Bron, which are major differences with mailboxes.db usage since 2.4 ?
I would like to do more test, can you direct me ?

Deniss

On 2016.11.18. 2:07, Bron Gondwana wrote:
> On Fri, 18 Nov 2016, at 10:51, Wolfgang Breyha via Info-cyrus wrote:
>> On 17/11/16 14:00, Deniss via Info-cyrus wrote:
>>> Any ideas or suggestion for investigation ?
>>
>> I already filed a bug
>> https://github.com/cyrusimap/cyrus-imapd/issues/43
>> but no response so far. I directly asked Bron, but no response as well.
> 
> Sorry, I really don't have a clue.  2.5 does have a different mailboxes.db format, so it's a bit more CPU intensive.  The real massive win for CPU usage is going to come with reverse ACLs:
> 
> https://blog.fastmail.com/2015/12/05/reverse-acls-making-imap-list-fast/
> 
> But to get there, we need to solve reverse ACLs for groups.  I did ask about it here:
> 
> https://lists.andrew.cmu.edu/pipermail/info-cyrus/2015-November/038628.html
> 
> But then didn't follow up to add group reverse ACL support in Cyrus, so reverse ACLs are broken if you're using groups.
> 
> Bron.
>