delayed delete

Fri Nov 17 01:46:27 EST 2006

On Thu, Nov 16, 2006 at 04:44:48PM -0500, Wesley Craig wrote:
> I'm happy to supply C code to  
> implement these functions, once I've gotten some acknowledgment of  
> the current work.

I don't know about official Cyrus acknowledgement, but I'm very happy
to acknowledge anyone doing the heavy lifting to make this possible.
It's one of the few really annoying things we're seeing with Cyrus
replication.

As soon as I get some tuits I'm going to post all our patches again,
we have a couple more including one to cyr_expire to check if the
cyrus.expunge file exists before mmapping the index and header files,
and also adds the "-a" flag for "ignore annotations", so you can skip
having to stat the annotations DB multiple times per mailbox when you
don't have any annotations set.

> In our risk analysis, I think the lack of differentiation in the sync  
> protocol is actually a virtue.  If the sync protocol allowed the  
> primary backend to "really delete" data from the replica, then  
> operator error on the primary backend would be more likely to cause  
> unrecoverable data loss.  As it stands, unrecoverable data loss can  
> only be caused by operator malfeasance.  In all likelihood, tape  
> backups would be vulnerable to the same sort of malfeasance.  I'm  
> considering the sync protocol issue closed.

Definitely.  I much prefer the cleanups happening "per slot" as we call
it in our system.  We have between 8 and 18 "slots" per machine, each of
which is associated with a single store.  Every store has two slots
which should (theoretically) be on different machines.  One of them is
at any time the master (configured via a database table) and the init
scripts use that information to decide if they should bring the slot up
with the master or replica config.  We then use IpAddr2 from the
heartbeat package to bind the master and replica IP addresses
appropriately within the init scripts and start or stop cyrus.  That way
the failover process is just "shut both down, change DB, start both up".

The shutdown code for a failover also scans the sync directory and runs
any logs it finds.  It refuses to cut over unless that's successful.  Of
course there's a "disaster mode" if a machine is away, but usually we're
using this so we can take a machine down for upgrades.

Also, all slots are on external SCSI attached RAID units, with two
partitions, data on large disks in RAID5 and meta on smaller high-speed
disks in RAID1.  Seems to balance the IO out pretty well.

It turns out that once you let subversion, Template Toolkit and a
database do all the heavy lifting, running 18 separate instances of
Cyrus on a machine with maximum 300Gb data partitions per instance and
gives much better recoverability than anything else we've tried because
a disk corruption or similar failure means only one goes down - and even
a total machine failure doesn't hurt too badly because each machine's
replicas "fan out" to 4 or 5 other machines with a couple of slots each
that switch to master status.  Much easier to absorb the load hit that
way.

Once we do something like you're talking about above, that's the last
gap that I'm concerned about where data loss is likely.

Bron.