Distributed File Systems

Michael Fair michael at daclubhouse.net
Sun Oct 20 02:58:13 EDT 2002


> Michael,
>     What about using Red Hat's clustering manager
> to mitigate the client side decision process. If
> the cluster is being managed server side, and a
> failure is noticed, in theory the cut-over would
> be made without affecting the virtual ip that
> clients are using to connect.

I've never used Red Hat's cluster product and only
recently heard about it.  My question is wide can
the cluster be and still get good performance?

If you are limited to a single site and a single
physical network, then this type of technology
should absolutely work as no matter how "chatty"
the cluster is, there is a way to get the bandwidth
needed and the "virtual ip" of the cluster will
always be successfully routed to proper physical
space.

What I'm talking about though is geographic dispersion.

IP addresses for all intents and purposes (with two
exceptions I'll talk about in a moment) are tied to
a relatively small geographic location.  When someone
on the internet asks for ip address 111.111.111.111
or address 222.222.222.222 you can track that IP
to a physical place.  So let's say your virtual IP
is 111.111.111.111 and all of a sudden the Internet
access to that IP fails.  You're wonderful cluster is
no longer accessible to all remote users because
your virtual IP can no longer be routed.  The only
way around that is multiple redundant internet
connections and BGP peering which is not a typical
proposition and they don't let just anyone become
a BGP peer (you can really FSCK things when you
screw up).  However in most situations you still end
tying address 111.111.111.111 to the same physical
site so if your site fails for whatever reason you're
still screwed.  It is possible using BGP to detect
site failure and redirect to a different site.
So multiple connections and BGP peering is one
way to get around a routing failure for address
111.111.111.111.


Another method I've just begun drawing up is using a
VPN relationship with a Tier 1 provider to use one
of their IP addresses which is almost assuredly going
to be available and then have it be redirected to one
of a set of backend servers.  This is essentially the
same as colocating a frontend server at the Tier 1
provider just without the server.  This doesn't get
around the physical location problem, it just
circumvents it by creating a layer of inderection
at a place that is extremely unlikely to fail.
So if you could make this Tier 1 IP the "virtual ip"
for your cluster, then I'd say you're in pretty good
shape.  It just brings up the question of ensuring
that there is significant enough bandwidth to each
of the nodes located at each remote site.  It also
requires that the cluster be smart enough to use
the available servers efficiently which it may very
well do, but methodologies have to change when you
can't gaurantee a high bandwidth connection between
cluster nodes.


In my world email should be just as robust as DNS.
Here are four servers, try each one until you succeed.
I haven't read all the specs regarding mail protocols
so I can't be sure if one of those protocols doesn't
already demand that behavior.  If not, I would recommend
it's inclusion in IMAPv5 and any other mail retrieval
protocol.  The client should expect to be given multiple
IP addresses and try them each in succession.


If that becomes true than fully fault tolerant email
becomes a solveable problem as SMTP can already gaurantee
delivery one of the available sites, and once that's done
that site can then provide the rest of the architecture
with the necessary information.  It's not an easy problem
to solve, but it is a solveable problem.

-- Michael --

> ----- Original Message -----
> From: "Michael Fair" <michael at daclubhouse.net>
> To: "Sebastian Hagedorn" <Hagedorn at uni-koeln.de>; "David Chait"
> <davidc at bonair.stanford.edu>
> Cc: <info-cyrus at lists.andrew.cmu.edu>
> Sent: Saturday, October 19, 2002 10:51 PM
> Subject: Re: Distributed File Systems
>
>
> Not to ruffle any feathers but this approach isn't
> any different philosophically from a DFS.  The only
> difference here is that the DFS is created by both
> servers accessing the same physical RAID drive rather
> than the by constantly sharing the FS data.
>
> The extremely significant downside to this approach
> is that the systems MUST be physically near each
> other (AKA only as far as the SCSI connection allows)
> and you don't really get that much redundancy.
>
> In my experience (all mileage varies) I end up dealing
> with network outages far more often than I do server
> failure.  So I consider a network failure my number
> one priority.  Behind that are the disk drives, followed
> by the power supplies, then lastly motherboard and
> compenents.   This approach shares an external RAID
> array across two machines.  So in other words, the
> only thing you are protecting yourself against is
> the failure of the server compenents inside the box
> (power supply, CPU, moterboard, SCSI controller, and
> network card) and the disk drives.  These are only the
> second to least likely to things to fail.
>
> Further, you've introduced yet another server and power
> supply failure risk with the RAID array itself.  While
> I am in favor of external RAID arrays for large drive
> capacity scaling and high I/O throughput, I'm also
> aware I'm also aware that I am usually introducing a
> single point of failure.  You are lost if the array
> itself goes or there is a failure of any of the other
> components connecting this thing to your network.
>
>
>
> The power supply in the server itself (the third most
> likely component to fail) can be protected by simply
> getting a dual redundant hot swappable power supply on
> the server.  I've bought several 2u rackmounts with
> this feature and if you're really desparate Dell makes
> a 1u with dual redundant power (Dell sells the only ones
> I've ever seen).  Of course dual redundant power is most
> effective when you can put each supply on a separate
> circuit and doesn't do as much good if they are both on
> the same circuit because you're only protected against
> supply failure, not power loss, but that's a separate
> issue.  There are also ways to solve the source of power
> issue that I feel are beyond the scope of this email,
> so I'll just stick to failures of the supply itself for
> the rest of this document.
>
>
>
> The network interface is pretty standard to get redundancy
> on the motherboard so let's assume that's redundant and in
> the same box we bought with the dual power and it's been
> set up so that if one network interface fails the other
> takes over.  A better approach is to point the two interfaces
> out different physical network topologies, but that creates
> other routing and fail over problems, so again, I'll just
> stick to physical interface failure for this email.
>
>
>
> Instead of the same external SCSI RAID array this
> other solution proposed, I'm going to use an internal
> hot swappable RAID 1 interface.
>
>
> So let's compare:
>
> - 2 servers w/ dual power, dual CPU, and dual network +
>   external RAID 5 also with dual power.
> appx cost: $10,000
>
> Redundancy level:
> Protected against all failures except RAID array internal
> components failure.
>
>
> - 1 server w/ dual power, dual CPU, and dual network +
>   internal hot swappable SCSCI RAID 1
> appx cost: $3,000 (I've built these with IDE for $2,500)
>
> Redundancy level:
> Protected against all failures except server internal
> components failure.
>
>
> Essentially you end up with the same amount of risk
> for about 30% of the cost.  If anyone sees something
> different and would like to correct me, please do.
> Uptime is something I think we all take seriously and
> I would appreciate being corrected.  There are hot
> swap PCI technologies and card slot servers that can
> increase the fault tolerance even higher, but I have
> not had the pleasure of working on or pricing them.
>
>
>
> Now I don't know about you, but all the data I've gathered
> says that you are more likely to purchase a faulty CPU,
> motehrboard, or SCSI controller than you are to have one
> fail on you.  For most organizations, that's an acceptable
> risk.  For those that can't accept that risk, if you
> purchase two of the servers I just described and add in
> a third and fourth network interface to act as a back
> channel between the two machines for real time drive updates
> and to use for notification of primary server failure
> then for $6,000 you've exceeded the redundancy of the
> original RAID 5 setup.  I leave setting up the pair for
> drive updates and fail over as an excercise for the reader.
>
>
>
> Now let's look at how much fault tolerance we've really
> gotten.  Best case scenario is we've spent the $6,000
> for the dual server setup.  That leaves us protected
> against everything except for the number 1 most likely
> component to fail which is the network.  Now we can
> go spend the rest of that original $10,000 on creating
> a fully redundant network but I believe a better solution
> is geographic dispersion and hot fail over.
>
>
> The problem with geographic dispersion is making sure
> you're hot fail over A) has as much recent data as
> possible, B) can detect it needs to fail over and
> C) can be taken advantage of by your end users.
>
> A and B) is exactly where distributed file systems
> become useful.
>
> But unfortunately, using a DFS (assuming a suitable
> one can be found) is only 2/3 the problem.  The other
> 1/3 is making sure that clients know how to get to the
> new server once a failure occurs.
>
> One less than optimal solution is to put your DNS TTL
> at some insanely low number like 5 minutes (or 1 for
> the ultra paranoid).  Then when the primary server fails,
> you update the DNS and within a few minutes everyone is
> running again.  This however doesn't allow you to take
> any advantage of all that redundant hardware you've
> invested in.  It simply waits until it's called upon.
>
> Ideally what you want is something where you can read
> and write to any server and be gauranteed that all others
> will reflect any changes made.  A DFS can help with this
> problem since the Cyrus server itself is written in such
> a way as to expect muliple processes trying to access the
> same set of files (the fact the other processes are on a
> different server is hidden by the DFS).  But it can't
> solve the end user fail over problem.
>
> So at best what we get is that any end user could be
> tied to any server in the set of available servers.
> But there is no way for a server to automatically
> switch to a secondary server without user intervention.
> (The problem is the same whether it is a frontend
> server or a backend server.  While theoretically a
> frontend server could be made smart enough to search
> a set of backend servers, you still have the same
> problem if a frontend server becomes inaccessible.
>
>
> I'm not certain how to solve the end user client dilemma.
> Ideally I'd like to just give the client N IP addresses
> in response to its DNS query and expect it to choose one
> it can actually get to and fail only if it couldn't
> contact any of them.  One way to simulate this would be
> to make something like perdition smart enough to
> understand about a set of possible sources and then put
> the perdition server as close as possible to the end
> users to maximize the amount of network tolerance it can
> handle.  Another way (though not as easy) would be to put
> software on the on the client itself and have it be the
> smarts.  But that to me just seems like an unmanageable
> solution and certainly won't work for ISPs that are
> frowned upon for forcing their clients to install software
> on their computers.
>
>
> Assuming the client side can be solved, I see hope in the
> Cyrus Murder project being extended to allow backend
> servers to be mirrors of each other.  Either that or
> testing and integration with a setup like CODA which has
> addressed many of these issues in detail.  Unfortunately,
> CODA ultimately relies on user interaction to resolve
> double write conflicts, but it does support that all
> important disconnected operation.   Unfortunately again,
> I don't think that IMAP has anything in its protocol to
> ask an end user a question to help it resolve a conflict
> like that.
>
>
>
> So to sum up, I think the single biggest hurdle in this
> fault tolerance game is the end user's ability to be
> redirected when the server they are used to talking to
> fails.  Behind that is ensuring that multiple servers
> have an integrous copy of the mail store.  The behind
> that is fault tolerance of the server itself in the
> order of disk drive, power, then motherboard component.
>
>
>
> While I'm sure others may put things in a slightly
> different order than I have, I'm pretty sure I hit
> all the necessary points.  Please if I've missed one
> speak up.
>
>
> -- Michael --
>
> ----- Original Message -----
> From: "Sebastian Hagedorn" <Hagedorn at uni-koeln.de>
> To: "David Chait" <davidc at bonair.stanford.edu>
> Cc: <info-cyrus at lists.andrew.cmu.edu>
> Sent: Saturday, October 19, 2002 8:21 AM
> Subject: Re: Distributed File Systems
>
> -- David Chait <davidc at bonair.stanford.edu> is rumored to have mumbled on
> Freitag, 18. Oktober 2002 23:23 Uhr -0700 regarding Distributed File
> Systems:
>
> Hi,
>
> >     Has anyone here looked into or had experience with Distributed File
> > Systems (AFS, NFS, CODA, etc) applied to mail partitions to allow for
> > clusetering or fail over capability of Cyrus IMAP machines? I have seen
> > docs for splitting the accounts between machines, however this doesn't
> > seem like the best fault tollerant solution.
>
> distributed file systems don't work. Look here for a different approach:
>
>
<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&msg=
> 17132>
> --
> Sebastian Hagedorn M.A. - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10
> Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
> Universität zu Köln / Cologne University - Tel. +49-221-478-5587
>
>
>
>





More information about the Info-cyrus mailing list