switch to cyrus murder (aggregator) feedback

Bron Gondwana brong at fastmail.fm
Mon Sep 22 08:51:57 EDT 2014


On Mon, Sep 22, 2014, at 09:20 PM, Michael Menge wrote:
> Hi,
> 
> 3 weeks ago we changed our changed cyrus imap servers form stand
> alone systems to a cyrus murder cluster. We have ~44000 accounts,
> ~457000 Mailboxes, and 2x6.5 TB Mails
> 
> In our previos setup we had 6 cyrus imap 2.4.17 servers running as KVM
> VMs with 8 GB memory and 4 Cores each, on an HP Blade center (G7 Blades).
> Each server was running 2 cyrus instances one master system an one replica
> of one of the other servers. We used DNS cnames to distribute our users to
> our servers. The filesystems are stored on two Infortrend iSCSI Raids, so
> that the replic is not on the same iSCSI system as the master.
> 
> In our new setup each server is running 3 - 4 cyrus instances.
> One Frontend, one backend, one replic and on one of the servers
> the cyrus mupdate master. ClusterIP is used to distribute the access
> to our frontend instances. The backend and replics are only listening
> on private IPs.
> 
> If one server goes down, we will switch that ClusterIP bucket to one
> of the other servers, and we will restart the replic as backend by changing
> the config and switch the IP of the replic with the ip of the backend. This
> is much faster than updating  the mailbox location of all the affected
> mailboxes.
> 
> If the mupdate master is down we start it on one of the other servers,
> using the mailboxdb of the frontend and running "ctl_mboxlist -m -a"
> on all backend instances.
> 
> Since the migration we discovered some small issues and some bugs.
> 
> 1. usually Cyrus is not CPU bound. One exception is the mupdate master
>     keeping encrypted connection to all frontends and establishing
>     new encrypted connections from the backend for every mailbox creation,
>     rename and remove, was too much for the 4 cores so we added 4 additional
>     cores to the VMs.

This sounds like another thing for the STARTTLS in case number 2...

> 2. Our frontend instances use IMAPs and POP3s and don't allow STARTTLS.
>     But we hat to use IMAP and POP3 with STARTTLS on our backends, as
>     the frontends will always use STARTTLS over IMAP and POP3 to proxy
>     the connection.

That kind of sucks.  It should be a configuration option to control whether
frontends use STARTTLS.  I wonder if they're smart enough to not try it if
the server doesn't advertise it?  You could use suppress_capabilities in
that case.

> 3. We see more IOERRORs in our cyrus logs. In the standalone
>     cyrus imap IOERROR indicated a corruption in one of the cyrus files
>     but that is not the case for the new errors we have found:
> 
>     a) "reading message: unexpected end of file" as far as i can tell,
>        this is triggert by the imap append command. I suspect when the
>        connection between frontend and backend is lost or the frontend
>        dies during upload of the message.

Yeah, I'm not a fan of this.  It happens in non-murder too.

>     b) "opening index %s: Invalid mailbox name" the mailbox name seem to
>        be fine in most cases. I haven only figured out why the mailbox
>        name was considered invalid in one case (the Sting "Posteingang"
>        was translated by the client and the name "INBOX" ins reserved.

If you give me some other names, I might be able to see why...

>     It would help if the String IOERROR would not be used in these cases,
>     and if the mailbox name would always be logged consistent to the
>     unixhierarchysep option.

The mailbox names will not be changed.  The format you see is the internal format, and it also includes domain!user. if you are domain split.  If anything, I would change more of the tools to use that format, because it's exact.  The algorithm to convert between them is easy enough if you need to for display purposes, but for low level debugging, exactness matters more.

> 4. Deleting an mailbox with delete_mode: delayed can create a corrupt
>     mailbox in the DELETED tree. In the logs we found the following:
> 
>     be/beimap[62020]: Rename: user.LoginID.Mail.drafts ->  
> DELETED.user.LoginID.Mail.drafts.5416CD11
> 
>     be/beimap[62020]: MUPDATE: can't commit mailbox entry for  
> 'DELETED.user.LoginID.Mail.drafts.5416CD11'
>     be/beimap[62020]: Deleted mailbox DELETED.user.LoginID.Mail.drafts.5416CD11

OOh - so that's a problem with murder specifically.  I wonder why it can't be committed.

>     and on the next cyr_expire run
> 
>     be/cyr_expire[144388]: IOERROR: opening index  
> DELETED.user.LoginID.Mail.drafts.5416CD11: System I/O error

Yeah, makes sense.  Maybe it's being created with the wrong type fields.

>     in the filesystem DELETED/user/LoginID/Mail/drafts was an empty directory.
>     I couldn't find any hints why the mupdate master couldn't commit the
>     mailbox entry, but as "5416CD11" is the timestamp of the action, I am
>     certain that the mailbox did not exist in the mailboxdb before. And as
>     this only happens in some rare cases I suspect a race condition.

Smells like it from your description.

> 5. Some frontend imapd processes receive a SIGSEGV.
>     As this seams to happen in the libopenssl I asked on their mailinglist,
>     but didn't receive an answer jet. At the end you will fine an BT of the
>     core dump.

Ok...

> I would be glad if changes regarding the logging of IOERRORs
> and mailbox names would be included in Cyrus 2.5

The IOERROR for append could certainly be changed.  It's tricky because it's
deep in a library layer, but since it's the ONLY place I've ever seen that error
message, it could at least be changed to be more descriptive about probable
cause.

> Regarding 4. and 5. are these known bugs? I could not find any matching
> entries in the bug tracker. If they are not know I would add them to  
> the bug tracker.

I don't know about them.  Go ahead and add them, worst case we find a duplicate
and merge it.

Thanks for the report!  Lots of good detail there.

>     #1  0x00007fe5a839334f in ssl3_get_message (s=0x80e430,  
> st1=8347825, stn=-1470427072, mt=<optimized out>, max=102400,  
> ok=0x7fffcc974d08)
>      at s3_both.c:522
>     #2  0x00007fe5a838ba0d in ssl3_get_key_exchange (s=0x0) at s3_clnt.c:1103
>     #3  0x00007fe5a838dff8 in ssl3_connect (s=0x80e430) at s3_clnt.c:316
>     #4  0x000000000046a177 in tls_start_clienttls (readfd=16,  
> writefd=16, layerbits=0x7fffcc975104, authid=0x7fffcc975108,  
> ret=0x7e1fa0,
>      sess=0x7e1fa8) at tls.c:1311
>     #5  0x00000000004669f4 in do_starttls (s=0x7e16a0,  
> tls_cmd=0x78a4d0 <imap_protocol+208>) at backend.c:201
>     #6  0x0000000000467217 in backend_authenticate (s=0x7e16a0,  
> prot=0x78a400 <imap_protocol>, mechlist=0x7fffcc976468,
>      userid=0x7f5c90 "REPLACED_LOGINID", cb=0x80de30,  
> status=0x7fffcc976460) at backend.c:378

This all looks sane.  I can't see anything to suggest that it's passing NULL pointers
or anything, so it's probably an SSL library issue.  I don't know my way around libssl
that well though to know.

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm


More information about the Info-cyrus mailing list