Distributed File Systems

Tue Oct 22 17:15:01 EDT 2002

> The easiest way to have fault tolerance would be to 
> match up your IMAP servers in an active/active setup 
> where each IMAP server has another server that's 
> willing to take over if a failure occurs. 

As I mentioned earlier in this thread this seems a
rather costly approach for what little redundancy you
get.  The only things you are protecting yourself 
against are the disk drives and server compenents 
inside the primary server box, namely power supply, 
CPU, moterboard, SCSI controller, and network card.
Your RAID array, your power source, and your network 
paths remain the same in terms of failure points.

The power supply and network card can be easily 
made redundant and there is no requirement for 
external RAID to get disk redundancy so the only 
redundancy you are truly adding for all practical 
purposes by using this setup are CPU, motherboard, 
and SCSI controller.

Of all the things in the system, these three are the
ones I am the least concerned about.

Granted I personally favor geographic distrobution
using separate servers at physically separate sites
because I'm looking for fault tolerance in the form
of multi-master replication and fail over.

So fault tolerance to me means being able to use a server
from a different site because the server at my primary
site went away (note that site could simply mean network 
segment).  I also always prefer to be able to take 
advantage of any hardware actually turned on.  So there 
wouldn't be any "spare" servers just tolerant "live" 
servers.  Hence the multi-master definition.

The folks working on the "Spread" toolkit (WAN group
communication services) had an excellent article on
creating a very efficient multi-master postgres DB.
The paper and concepts can be found here:
http://www.cnds.jhu.edu/pub/papers/cnds-2002-1.pdf
The things that I found extraordinary were they were
able to maintain a very high level of throughput while
simoultaneously gauranteeing correctness of the DB in
that all transactions happened in the same order across
all locations.  So it would be possible to implement
a multi-master lock synchronization tool to abstract
Cyrus away from relying on the file system locking.

What would be involved in creating either the cyrus
"master" process, or some other daemon responsible
for the file locking?  Specifically to support multiple
Cyrus processes operating on the same shared file
structure (via NFS, or multi server mount)?
Has there already been a good description of the 
problem posted to the list?

-- Michael --