Implement Cyrus IMAPD in High Load Enviromment

Tue Sep 29 01:42:38 EDT 2009

On Mon, Sep 28, 2009 at 03:33:44PM -0700, Vincent Fox wrote:
> Bron Gondwana wrote:
> >I assume you mean 500 gigs!  We're switching from 300 to 500 on new
> >filesystems because we have one business customer that's over
> >150Gb now and we want to keep all their users on the one partition
> >for
> >folder sharing.  We don't do any murder though.
> >
> Oops yes.  I meant 500 gigs.  The potential downside of
> running an fsck on  terabyte+ filesystems is not worth
> the risks IMO.  The tremendous speed & efficiency of
> Cyrus is in it's small files and the indexes.  However you
> have to keep that in mind when estimating not just backups
> and other daily/weekly items but more serious items.

For sure.

> Really I've looked at fsck too many times in my life and
> don't ever want to again.  Anyone who tells me "oh yes but
> journalling solved all that long ago...." will get an earful
> from me about how they haven't run a big enough setup
> with enough stress on it to SEE real problems.  I have seen
> both journalled Linux and logged Solaris filesystem turn up
> with data corruption and ended up staring at that fsck
> prompt wondering how many hours until it's done.....

Yep.  Which is why we treat filesystems as disposable :)
There are multiple real-time replicated copies of anything
we care about, so we can blow away a filesystem and just
recreate it.  Even after a successful fsck I might just
decide it's cheaper to recreate it than run a full sha1
checking audit_slot on the contents!

> The antiquated filesystems that 99% of admins tolerate and
> work with every day should be lumped under some kind of
> Geneva provision against torture.  It's a mystery to me why
> it's not resolved years ago and why there isn't a big push
> for it from anyone.

Patents I suspect, at least partially.

> "It doesn't matter how fast it is, if it isn't CORRECT!" should
> be some kind of mantra for a production data center but it
> still seems majority of my colleagues talk same as in 1980s'
> about how if we turn off this or that safety feature we can
> make the filesystem faster.

Everything's a tradeoff, hey.  With enough checksums and
replication, I'm willing to treat every layer as less than
100% reliable, because that's reality.  I haven't heard too
many horror stories of ZFS recently, but we certainly hit
a bug where we needed a software update before we could
replace a failed disk, because ZFS refused to consider
anything plugged into the same controller again, even after
a reboot.  That was odd.

> OK stepping off my soapbox now.

It's an interesting one.  For real reliability, I want to
have multiple replication target supported cleanly.  It's
not even that hard.  Basically you would chain sync_client
instances, such that there was an initial task that just
reads $conf/sync/log and appends the contents to both
$conf/sync/stream1/log and $conf/sync/stream2/log, then
a separate sync_client instance that operates in each of
$conf/sync/stream1 and $conf/sync/stream2, replicating to
separate backends.  This would involve minimal code changes
I suspect, and allow a replica to be offline while the
other two are up-to-date, and still know what needed syncing
when you turned it back on!

Then we'd be able to bring up a new replica BEFORE removing
the old one.  It's like RAID1 with three disks :)  Add a new
one, remove the old.  Always 2 up-to-date copies.

Then add management tools to make that easy to start and stop!
It's an ongoing task to improve reliability.

I actually wonder if it's possible to have multiple Cyrus
instances running in a mesh.  Each one running a sync_server
and with sync_client instances running on every other one.
In THEORY so long as you only wrote to one at any one time
you could read from any of them, or even if you only had
connections for a single user happening to one at any one
time you'd be OK.  You could hash users amongst them to 
balance the load.

Then - well, I already have checksums coded into index files,
just waiting code review from Ken to push that upstream.
Along with sha1s, that's 99% of the data covered by
checksums.  Flat files (quota and the like) I don't think are
viable, but it might be possible to add checksums to skiplist
as well, at the expense of a format change.  Not sure about
BDB.  I'm not a giant fan of it anyway - at least how it's
being used in Cyrus.  All our DBs are skiplist now, and we're
pretty happy with it :)

Bron.