Distributed File Systems

Michael Fair michael at daclubhouse.net
Sun Oct 20 01:51:46 EDT 2002


Not to ruffle any feathers but this approach isn't
any different philosophically from a DFS.  The only
difference here is that the DFS is created by both
servers accessing the same physical RAID drive rather
than the by constantly sharing the FS data.

The extremely significant downside to this approach
is that the systems MUST be physically near each
other (AKA only as far as the SCSI connection allows)
and you don't really get that much redundancy.

In my experience (all mileage varies) I end up dealing
with network outages far more often than I do server
failure.  So I consider a network failure my number
one priority.  Behind that are the disk drives, followed
by the power supplies, then lastly motherboard and
compenents.   This approach shares an external RAID
array across two machines.  So in other words, the
only thing you are protecting yourself against is
the failure of the server compenents inside the box
(power supply, CPU, moterboard, SCSI controller, and
network card) and the disk drives.  These are only the
second to least likely to things to fail.

Further, you've introduced yet another server and power
supply failure risk with the RAID array itself.  While
I am in favor of external RAID arrays for large drive
capacity scaling and high I/O throughput, I'm also
aware I'm also aware that I am usually introducing a
single point of failure.  You are lost if the array
itself goes or there is a failure of any of the other
components connecting this thing to your network.



The power supply in the server itself (the third most
likely component to fail) can be protected by simply
getting a dual redundant hot swappable power supply on
the server.  I've bought several 2u rackmounts with
this feature and if you're really desparate Dell makes
a 1u with dual redundant power (Dell sells the only ones
I've ever seen).  Of course dual redundant power is most
effective when you can put each supply on a separate
circuit and doesn't do as much good if they are both on
the same circuit because you're only protected against
supply failure, not power loss, but that's a separate
issue.  There are also ways to solve the source of power
issue that I feel are beyond the scope of this email,
so I'll just stick to failures of the supply itself for
the rest of this document.



The network interface is pretty standard to get redundancy
on the motherboard so let's assume that's redundant and in
the same box we bought with the dual power and it's been
set up so that if one network interface fails the other
takes over.  A better approach is to point the two interfaces
out different physical network topologies, but that creates
other routing and fail over problems, so again, I'll just
stick to physical interface failure for this email.



Instead of the same external SCSI RAID array this
other solution proposed, I'm going to use an internal
hot swappable RAID 1 interface.


So let's compare:

- 2 servers w/ dual power, dual CPU, and dual network +
  external RAID 5 also with dual power.
appx cost: $10,000

Redundancy level:
Protected against all failures except RAID array internal
components failure.


- 1 server w/ dual power, dual CPU, and dual network +
  internal hot swappable SCSCI RAID 1
appx cost: $3,000 (I've built these with IDE for $2,500)

Redundancy level:
Protected against all failures except server internal
components failure.


Essentially you end up with the same amount of risk
for about 30% of the cost.  If anyone sees something
different and would like to correct me, please do.
Uptime is something I think we all take seriously and
I would appreciate being corrected.  There are hot
swap PCI technologies and card slot servers that can
increase the fault tolerance even higher, but I have
not had the pleasure of working on or pricing them.



Now I don't know about you, but all the data I've gathered
says that you are more likely to purchase a faulty CPU,
motehrboard, or SCSI controller than you are to have one
fail on you.  For most organizations, that's an acceptable
risk.  For those that can't accept that risk, if you
purchase two of the servers I just described and add in
a third and fourth network interface to act as a back
channel between the two machines for real time drive updates
and to use for notification of primary server failure
then for $6,000 you've exceeded the redundancy of the
original RAID 5 setup.  I leave setting up the pair for
drive updates and fail over as an excercise for the reader.



Now let's look at how much fault tolerance we've really
gotten.  Best case scenario is we've spent the $6,000
for the dual server setup.  That leaves us protected
against everything except for the number 1 most likely
component to fail which is the network.  Now we can
go spend the rest of that original $10,000 on creating
a fully redundant network but I believe a better solution
is geographic dispersion and hot fail over.


The problem with geographic dispersion is making sure
you're hot fail over A) has as much recent data as
possible, B) can detect it needs to fail over and
C) can be taken advantage of by your end users.

A and B) is exactly where distributed file systems
become useful.

But unfortunately, using a DFS (assuming a suitable
one can be found) is only 2/3 the problem.  The other
1/3 is making sure that clients know how to get to the
new server once a failure occurs.

One less than optimal solution is to put your DNS TTL
at some insanely low number like 5 minutes (or 1 for
the ultra paranoid).  Then when the primary server fails,
you update the DNS and within a few minutes everyone is
running again.  This however doesn't allow you to take
any advantage of all that redundant hardware you've
invested in.  It simply waits until it's called upon.

Ideally what you want is something where you can read
and write to any server and be gauranteed that all others
will reflect any changes made.  A DFS can help with this
problem since the Cyrus server itself is written in such
a way as to expect muliple processes trying to access the
same set of files (the fact the other processes are on a
different server is hidden by the DFS).  But it can't
solve the end user fail over problem.

So at best what we get is that any end user could be
tied to any server in the set of available servers.
But there is no way for a server to automatically
switch to a secondary server without user intervention.
(The problem is the same whether it is a frontend
server or a backend server.  While theoretically a
frontend server could be made smart enough to search
a set of backend servers, you still have the same
problem if a frontend server becomes inaccessible.


I'm not certain how to solve the end user client dilemma.
Ideally I'd like to just give the client N IP addresses
in response to its DNS query and expect it to choose one
it can actually get to and fail only if it couldn't
contact any of them.  One way to simulate this would be
to make something like perdition smart enough to
understand about a set of possible sources and then put
the perdition server as close as possible to the end
users to maximize the amount of network tolerance it can
handle.  Another way (though not as easy) would be to put
software on the on the client itself and have it be the
smarts.  But that to me just seems like an unmanageable
solution and certainly won't work for ISPs that are
frowned upon for forcing their clients to install software
on their computers.


Assuming the client side can be solved, I see hope in the
Cyrus Murder project being extended to allow backend
servers to be mirrors of each other.  Either that or
testing and integration with a setup like CODA which has
addressed many of these issues in detail.  Unfortunately,
CODA ultimately relies on user interaction to resolve
double write conflicts, but it does support that all
important disconnected operation.   Unfortunately again,
I don't think that IMAP has anything in its protocol to
ask an end user a question to help it resolve a conflict
like that.



So to sum up, I think the single biggest hurdle in this
fault tolerance game is the end user's ability to be
redirected when the server they are used to talking to
fails.  Behind that is ensuring that multiple servers
have an integrous copy of the mail store.  The behind
that is fault tolerance of the server itself in the
order of disk drive, power, then motherboard component.



While I'm sure others may put things in a slightly
different order than I have, I'm pretty sure I hit
all the necessary points.  Please if I've missed one
speak up.


-- Michael --

----- Original Message -----
From: "Sebastian Hagedorn" <Hagedorn at uni-koeln.de>
To: "David Chait" <davidc at bonair.stanford.edu>
Cc: <info-cyrus at lists.andrew.cmu.edu>
Sent: Saturday, October 19, 2002 8:21 AM
Subject: Re: Distributed File Systems

-- David Chait <davidc at bonair.stanford.edu> is rumored to have mumbled on
Freitag, 18. Oktober 2002 23:23 Uhr -0700 regarding Distributed File
Systems:

Hi,

>     Has anyone here looked into or had experience with Distributed File
> Systems (AFS, NFS, CODA, etc) applied to mail partitions to allow for
> clusetering or fail over capability of Cyrus IMAP machines? I have seen
> docs for splitting the accounts between machines, however this doesn't
> seem like the best fault tollerant solution.

distributed file systems don't work. Look here for a different approach:

<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&msg=
17132>
--
Sebastian Hagedorn M.A. - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587





More information about the Info-cyrus mailing list