Funding Cyrus High Availability
dpc22 at cam.ac.uk
Fri Sep 17 05:20:54 EDT 2004
On Fri, 17 Sep 2004, Paul Dekkers wrote:
> Isn't it possible to have equal roles? If all changes are put in some
> backlog, and a synchroniser process runs on both machines and pushes the
> backlog (as soon as there is any) to another machine... then you can
> have the some process on both (equal) servers... Of course there needs
> to be some more intelligence, but that's basicly what I would expect.
We have 16 servers: half the accounts on each system are master copies and
half are replicas. Each machine has a small database (a CDB lookup file)
to tell it whether a given account is master or slave. The replication
engine (which runs independently from the normal master spawned jobs)
bails out rapidly if the replica copy of an account is updated: it would
proceed to transform the master into a copy of the replica, but that's
probably not what you wanted :). I have a tool which allows me to switch
the master and replica copy for any (inactive) account without having to
shut anything down. This tool also lets me migrate data off onto a third
system and immediately create a replica of that. This makes upgrading
operating systems a much less fraught task.
> In my sketch above (really not sure if it works of course) where both
> have something like a backlog you can like "tail" that backlog and push
> the update as soon as possible to the second machine. You solve the
> thing you mention with delays while pushing updates to two servers at
> the same time.
Yes, that's exactly how my code works. Asynchronous replication (which Ken
called lazy replication) is fairly easy to do in Cyrus. Synchronous
replication, where you only get a response to an IMAP/POP/LMTP command
when the data is safely committed to the replica, would involve a much
more substantial rewrite of the Cyrus code.
That's where block based replication schemes like DRDB have a big
advantage: the state that they have to track is much less involved.
I'm currently running with a replication cycle of one second on my live
servers for "rolling" replication (that's just a name I made up, its not
an official term), so on average we would lose of half a second of update
traffic for 1/16th of our user base if a single system failed. Further
safeguards are possible by keeping copies of incoming mail for a short
time on the MTA systems, but that's not really a Cyrus concern.
We also replicate to a tape backup spooling engine overnight. The
replication engine is rather useful for fast incremental updates.
>>> If one server is down it should mean that all tasks can be performed
>>> at the other one. I 'm curious how this would look if both servers are
>>> still running but cannot reach eachother. If there is indeeed a UUID:
>>> what if there are doubles... but I guess that has been taken into
UUIDs are just a convenient representation of message text, so that you
can pass messages by reference rather than value. Duplicates don't matter
(though I don't believe that they actual occur given my allocation scheme)
so long as the message text is the same. I maintain databases of MD5
checksums for messages and cache text just to be on the safe side.
UUIDs were originally just Mailbox UniqueID + Message UID. Unfortunately,
UniqueID isn't very Unique: its just a simple hash of the mailbox name. I
ended up allocating UUIDs in large chunks from the master process on each
machine. If a process runs out of UUIDS (which would take some going as
they are allocated in chunks of 2**24), it falls back to call by value.
David Carter Email: David.Carter at ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus