Potential for data loss by running concurrent RENAMEs in cyrus-imapd 2.4

Sebastian Hagedorn Hagedorn at uni-koeln.de
Thu Feb 6 10:49:05 EST 2020


Hi,

let me preface the following by saying that I'm an idiot, and I don't blame cyrus-imapd in the slightest. However, maybe there is potential for improved error handling here, so I'm going to report what happened last night.

We are (still) running cyrus-imapd 2.4.20. I have written a script that moves users from one partition to another. It's been working fine for a long time. The script lists the users on the source partition, checks if they are active, and if they aren't, it executes "rename user/USER user/USER TARGETPARTITION".

Yesterday, I was confused and accidentally managed to run two instances of the script simultaneously overnight. This caused many error messages, and also the loss of three mailboxes. In most cases no harm appears to have been done. There are some "Mailbox already exists" messsages, when the first instance of the script had already begun to move a mailbox, but as far as I can tell that didn't do any damage. In the case of the three lost mailboxes, there are the following messages in the logfile:

Feb  6 02:03:06 xxx.rrz.uni-koeln.de imapv6[95539]: IOERROR: opening /var/spool/imap3/K/user/aaa/SOffice/Writer/172.: No such file or directory
Feb  6 02:03:06 xxx.rrz.uni-koeln.de imapv6[95539]: IOERROR: opening /var/spool/imap3/K/user/aaa/SOffice/Writer/172.: No such file or directory
Feb  6 02:36:39 xxx.rrz.uni-koeln.de imapv6[122884]: IOERROR: opening /var/spool/imap3/P/user/bbb/sent-mail/1.: No such file or directory
Feb  6 02:36:39 xxx.rrz.uni-koeln.de imapv6[122884]: IOERROR: opening /var/spool/imap3/P/user/bbb/sent-mail/1.: No such file or directory
Feb  6 04:49:39 xxx.rrz.uni-koeln.de imapv6[77903]: IOERROR: opening /var/spool/imap3/S/user/ccc/Templates/1.: No such file or directory
Feb  6 04:49:39 xxx.rrz.uni-koeln.de imapv6[77903]: IOERROR: opening /var/spool/imap3/S/user/ccc/Templates/1.: No such file or directory

/var/spool/imap3 is the source partition. My best guess is that for those three mailboxes the two concurrent renames happened exactly at the same time, so that there was some race condition with locking the mailbox(?). Anyway, these three mailboxes were just gone this morning. They were neither on the old partition nor the new, but interestingly in mailboxes.db they are listed on the new partition. I have recreated the mailboxes on disk, so all is fine now.

Again, this is clearly my fault, but perhaps better error handling could avoid such scenarios anyway? Perhaps it's already fixed in newer releases? If so that's one more incentive to finally upgrade :-)

I have now added a locking mechanism to my script so that it shouldn't be possible to run two instances anymore.

Cheers
Sebastian
-- 
    .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
                 .:.Regionales Rechenzentrum (RRZK).:.
   .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hagedorn.vcf
Type: text/x-vcard
Size: 333 bytes
Desc: not available
URL: <http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20200206/11c81fad/attachment.vcf>


More information about the Info-cyrus mailing list