Cyrus with a NFS storage. random DBERROR

Sun Jun 10 04:42:01 EDT 2007

> I suspect that the problem is with mailbox renames, which are not atomic
> and can take some time to complete with very large mailboxes.

I think there's some other issues as well. For instance we still see 
skiplist seen state databases get corrupted every now and then. It seems 
certain corruption can result in the skiplist code calling abort() which 
terminates the sync_server, and causes the sync_client to bail out. I had a 
back trace on one of them the other day, but the stack frames were all wrong 
so it didn't seem that useful.

> HERMES_FAST_RENAME:
>   Translates mailbox rename into filesystem rename() where possible.
>   Useful because sync_client chdir()s into the working directory.
>   Would be less useful in 2.3 with split metadata.

It would still be nice to do this to make renames faster anyway. If you did.

1. Add new mailboxes to mailboxes.db
2. Filesystem rename
3. Remove old mailboxes

You end up with a race condition, but it's far shorter than the mess you can 
end up with at the moment if a restart occurs during a rename.

> Together with my version of delayed expunge this pretty much guarantees 
> that things aren't moving around under sync_client's feet. Its been an 
> awful long time (about a year?) since I last had a sync_client bail out.
>
> We are moving to 2.3 over the summer (initially using my own original 
> replication code), so this is something that I would like to sort out.
>
> Any suggestions?

I can try and keep an eye on bailouts some more, and see if I can get some 
more details. It would be nice if there was some more logging about why the 
bail out code path was actually called!

Rob