Future plans for replication

Fri Aug 12 07:37:39 EDT 2011

On Fri, 12 Aug 2011 11:16:30 +0200, Julien Coloos <julien.coloos at atos.net>  
wrote:

> Hi there,
>
> Are there any known plans for the future of replication inside cyrus ?
> There are a few things we dream of having, and that we consider  
> implementing. But if some of those do overlap with cyrus roadmap, maybe  
> we could help you by sharing the work ?
>
> In no particular order:
>
> 1) Master-master replication
> There are many difficulties to overcome for this feature, as well as  
> many ways to achieve it:
>    - 2-servers only, multi-master, ... ?

N servers - mesh, chain, whatever.  Ideally any layout will work.

My plan (which is partially done) is named channels for sync - basically
the name of the server you're connecting to.  When you make a change,
it injects a record into the sync log for each channel you have configured.
The sync_client for that channel then runs a reconciliation of every
listed item.

If sync_server makes a change, is should suppress the sync_log for the
channel on which the change came in - so it doesn't try to replicate
back.  But it doesn't hurt too badly, because the data will be in sync
anyway, in theory...

>    - optimistic replication ?

There are some things that I want to add to this, in particular a
"replication cache" which remembers the state of the mailbox at
last sync, allowing a sync_client to optimistically generate a
"change request" to the sync_server, listing all the changes that
would bring the target mailbox up-to-date, and also listing the
current state it EXPECTS on the server.  That way we can check
up-front if anything on the replica side has already changed, and
fall back to a full replication run.

The "full replication run" would first reconcile everything past the
last known "in sync" modseq, and only if that failed would it
send the entire mailbox contents.

>    - since most meta-data do not (yet ?) have some kind of revision  
> control, how to reconciliate when concurrent changes happen ?

Add revision control to everything.  In particular:

a) modseq to all annotations
b) keep DELETED entries in mailboxes.db with the UIDVALIDITY of
    the deleted mailbox (also on moves)
c) version numbers and "deleted markers" for all sieve scripts

>      -> actually, should data be allowed to diverge naturally ? that is,  
> are users allowed to hit any master for the mailbox, or should they be  
> directed to *the* master reference of the mailbox (other masters being  
> hit only when the reference is down)
>    - ...

They should be directed to the primary copy - IMAP and eventual
consistency are not friends.  You can make it work with UID promotion
logic, but flag merging will always be an inexact science unless we
keep MODSEQ and LASTMODIFIED separately for every single flag, and
the UID promotion still sucks when it happens to you.

> 2) Dedicated (list of) replication server(s) per mailbox
> For example there is the default server-wide replication configuration,  
> but a new per-mailbox annotation could override that global setting.
> One of the advantages of this feature is to allow to put the load on  
> more than one server (which is the replica in current situation) when a  
> master is down.

I want to do this in mailboxes.db.  It's a lot of work, but extending the
murder syntax to allow multiple "locations" to be specified for a mailbox
including listing the preferred primary would allow all this to fit in
naturally with murder.  And it would ROCK for XFER, because you would just
add the mailbox to a new server, replicate it, set the other end as the
preferred master, and then once all clients had finished their business
and been moved over, you could clean up the source server by removing the
mailbox from its list.

This is kind of the holy-grail of replication.

> 3) Handling messages partition that actually do not need replication
> For example:
>    - meta-data files are stored on local partitions, for faster access,  
> and do need to be replicated
>    - messages are stored on partitions shared with other masters and do  
> not need to be replicated; think NFS access (or similar), with data  
> already secured (e.g. RAID or proper filesystem replication)
> That's a peculiar scenario, and we know that feature may be a little  
> tricky (even more if trying to manage it in a generic way, taking into  
> account each kind of data/meta-data). But maybe it would benefit to  
> people other than us ?

Goodness me.  I hadn't even thought of this.  Yes, it could be done.
Actually the easiest way would be to check if the message already
exists in the spool before replicating the data contents across.

I can see the use-case.  Very interesting.  Mmm.  It should be fairly
easy to add a flag (possibly per partition/channel pair) for this.

> In our case such a scenario would actually make sense, notably  
> considering the number of mailboxes (tens of millions).
> Different people have different needs for their architecture, and the  
> more possibilities offered to cyrus users, the better. But not  
> everything can be implemented, and we guess choices will have to be made.
>
> Comments and inputs are welcomed :)

Let me know you need anything clarified in what I wrote there - that's
kind of "off the cuff", but it's stuff I've been thinking about for a
long time.

Bron.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/