Cyrus with a NFS storage. random DBERROR

Sat Jun 9 00:19:05 EDT 2007

>I run it directly, outside of master.  That way when it crashes, it
> can be easily restarted.  I have a script that checks that it's
> running, that the log file isn't too big, and that there are no log-
> PID files that are too old.  If anything like that happens, it pages
> someone.

Ditto, we do almost exactly the same thing.

Also if we switch master/replica roles, our code looks for any incomplete 
log files after stopping the master, and runs those first to ensure that 
replication was completely up to date.

It seems anyone seriously using replication has to unfortunately do these 
things manually at the moment. Replication just isn't reliable enough, we 
see sync_client bail out quite regularly, and there's not enough logging to 
exactly pinpoint why each time. I think there's certain race conditions that 
still need ironing out, because rerunning sync_client on the same log file 
that caused a bail out usually succeeds the second time. It would be nice if 
some code was actually made part of the core cyrus distribution to make this 
all work properly, including switching master/replica roles.

Rob