Future plans for replication

Julien Coloos julien.coloos at atos.net
Fri Aug 12 11:59:02 EDT 2011


Le 12/08/2011 13:37, Bron Gondwana a écrit :
> On Fri, 12 Aug 2011 11:16:30 +0200, Julien Coloos 
> <julien.coloos at atos.net> wrote:
>
>> 1) Master-master replication
>> There are many difficulties to overcome for this feature, as well as 
>> many ways to achieve it:
>>    - 2-servers only, multi-master, ... ?
>
> N servers - mesh, chain, whatever.  Ideally any layout will work.
>
> My plan (which is partially done) is named channels for sync - basically
> the name of the server you're connecting to.  When you make a change,
> it injects a record into the sync log for each channel you have 
> configured.
> The sync_client for that channel then runs a reconciliation of every
> listed item.
>

Nice, no wonder I though that this channel stuff I saw in the source 
code could be used to separate logs for different servers, since that 
was your intent :)
And proceeding that way allows to create mesh/chain topologies depending 
on the configuration.

> If sync_server makes a change, is should suppress the sync_log for the
> channel on which the change came in - so it doesn't try to replicate
> back.  But it doesn't hurt too badly, because the data will be in sync
> anyway, in theory...
>

Yeah, thinking about it it's not easy to prevent unnecessary 
reconciliations when topologies get complex. Fortunately propagation 
between servers quickly ends once they see they are in sync. And people 
who do care could use a chain topology to limit this.

>>    - optimistic replication ?
>
> There are some things that I want to add to this, in particular a
> "replication cache" which remembers the state of the mailbox at
> last sync, allowing a sync_client to optimistically generate a
> "change request" to the sync_server, listing all the changes that
> would bring the target mailbox up-to-date, and also listing the
> current state it EXPECTS on the server.  That way we can check
> up-front if anything on the replica side has already changed, and
> fall back to a full replication run.
>
> The "full replication run" would first reconcile everything past the
> last known "in sync" modseq, and only if that failed would it
> send the entire mailbox contents.
>

Would that cache be per-server, or shared on the platform (kinda like 
the MUPDATE server) ?

>>    - since most meta-data do not (yet ?) have some kind of revision 
>> control, how to reconciliate when concurrent changes happen ?
>
> Add revision control to everything.  In particular:
>
> a) modseq to all annotations
> b) keep DELETED entries in mailboxes.db with the UIDVALIDITY of
>    the deleted mailbox (also on moves)
> c) version numbers and "deleted markers" for all sieve scripts
>

Fair enough.
b is indeed useful to determine whether the mailbox was created on 
replica while master was down (needs to be copied back), or if it was 
deleted on master and was not yet pushed to replica.
A little bit curious about those "deleted markers" :) What do you have 
in mind ?

>>      -> actually, should data be allowed to diverge naturally ? that 
>> is, are users allowed to hit any master for the mailbox, or should 
>> they be directed to *the* master reference of the mailbox (other 
>> masters being hit only when the reference is down)
>>    - ...
>
> They should be directed to the primary copy - IMAP and eventual
> consistency are not friends.  You can make it work with UID promotion
> logic, but flag merging will always be an inexact science unless we
> keep MODSEQ and LASTMODIFIED separately for every single flag, and
> the UID promotion still sucks when it happens to you.
>

I also think that this would be easier to handle than trying to manage 
concurrent modifications on all replicas as the nominal case (thinking 
about it already give me headaches).
Guess MODSEQ and LASTMODIFIED for every flag would be a little overkill 
to achieve better merging.

Talking about UID promotion (great idea by the way :)), we wonder if 
there could be some gain using a centralised service (ala MUPDATE; hence 
my previous question about the "replication cache") which would be used 
to store and give the next UID of mailboxes. This could prevent UIDs 
collisions, replacing UID promotion need in general, and should have 
less impact on IMAP clients with agressive caching.

>> 2) Dedicated (list of) replication server(s) per mailbox
>> For example there is the default server-wide replication 
>> configuration, but a new per-mailbox annotation could override that 
>> global setting.
>> One of the advantages of this feature is to allow to put the load on 
>> more than one server (which is the replica in current situation) when 
>> a master is down.
>
> I want to do this in mailboxes.db.  It's a lot of work, but extending the
> murder syntax to allow multiple "locations" to be specified for a mailbox
> including listing the preferred primary would allow all this to fit in
> naturally with murder.  And it would ROCK for XFER, because you would 
> just
> add the mailbox to a new server, replicate it, set the other end as the
> preferred master, and then once all clients had finished their business
> and been moved over, you could clean up the source server by removing the
> mailbox from its list.
>
> This is kind of the holy-grail of replication.
>

Eh, didn't think about murder. Then yes it would be more logical to have 
it in mailboxes.db.

>> 3) Handling messages partition that actually do not need replication
>> For example:
>>    - meta-data files are stored on local partitions, for faster 
>> access, and do need to be replicated
>>    - messages are stored on partitions shared with other masters and 
>> do not need to be replicated; think NFS access (or similar), with 
>> data already secured (e.g. RAID or proper filesystem replication)
>> That's a peculiar scenario, and we know that feature may be a little 
>> tricky (even more if trying to manage it in a generic way, taking 
>> into account each kind of data/meta-data). But maybe it would benefit 
>> to people other than us ?
>
> Goodness me.  I hadn't even thought of this.  Yes, it could be done.
> Actually the easiest way would be to check if the message already
> exists in the spool before replicating the data contents across.
>

Yes, and maybe also when receiving a new message on the replica while 
the master is down: if a file corresponding to the UID is already there, 
we could assume it is actually already used and will come later from the 
concerned server (unless UIDs are given by a central service). Well, 
that would certainly be harder than that to handle, but those are 
already a few pointers.

> I can see the use-case.  Very interesting.  Mmm.  It should be fairly
> easy to add a flag (possibly per partition/channel pair) for this.
>
>> In our case such a scenario would actually make sense, notably 
>> considering the number of mailboxes (tens of millions).
>> Different people have different needs for their architecture, and the 
>> more possibilities offered to cyrus users, the better. But not 
>> everything can be implemented, and we guess choices will have to be 
>> made.
>>
>> Comments and inputs are welcomed :)
>
> Let me know you need anything clarified in what I wrote there - that's
> kind of "off the cuff", but it's stuff I've been thinking about for a
> long time.
>
> Bron.
>

That's indeed really well though :) Thanks for your answers.
And as explained in the initial mail we would be happy to help you where 
possible, since we do have similar goals here.


Regards
Julien


More information about the Cyrus-devel mailing list