Distributed File Systems

Sun Oct 20 17:12:43 EDT 2002

On Sunday 20 October 2002 04:07 pm, David Chait wrote:
> Jeremy,
>     While that would resolve the front end problem, in the end your mail
> partition would still be a single point of failure. I'm trying to find a
> way to do a real time replication of the mail partition between both
> machines to allow for a complete failover.

Yes, that's true. I believe then you're looking for a true geographically 
distributed system for failover. For me, this would blow my current cost 
margin :(. Some good reading may be at:

http://lists.canonical.org/pipermail/kragen-tol/2002-January/000655.html
http://www.usenix.org/publications/library/proceedings/usits97/full_papers/christenson/christenson_html/christenson.html
http://www.dsg.cs.tcd.ie/~vjcahill/sigops98/papers/saito.ps
http://www.hpl.hp.com/personal/Yasushi_Saito/pubs.html

Most of the above describes nonexistent transaction based geographic mail 
distribution. For now, we're forced to work up work arounds to the issue. I 
personally was considering into looking into the use of InterMezzo to attempt 
to accomplish this (from a purely experimental standpoint).

http://www.linuxplanet.com/linuxplanet/reports/4368/1/

For production use though, I've chosen to harden the heck out of my 
installation. Perhaps other folks be interested in sharing what their main 
sources of downtime are? For the systems I run they would be:

1> Downtime due to OS level software patches/upgrades.
2> Power supply failures (yes even one of my "carrier grade" sun 200R's just 
lost a power supply last month).
3> Dead disk drives (brain dead admins before me didn't understand disk 
mirroring).
4> Facility power maintenance (both the UPS and building power). Two months 
ago we had a power outage because they had to lubricate the contactors in the 
main panel.
5> Network outages.

I've been working to resolve most of the above issues.

1> The failover solution I previously described allows me to alleviate this 
issue.
2> I've got dual power supplies for almost all of my hardware now. One supply 
is plugged into the UPS on one circuit, the other is plugged into building 
power (through a filtering surge suppressor).
3> I'm using hardware RAID where I can, other areas are using software RAID on 
across separate SCSI controllers (on machines where hardware raid isn't 
directly cost effective).
4> When they have to maintenance the UPS I go to the power supplies on 
building power. When they have to maintenance building power I try to 
coordinate things such that the UPS will hold long enough to cover the 
outage.
5> Network outages are hard to compensate for because our backbone feed into 
the main data facility isn't designed for this. We have one main core router 
with all external fiber feeds running into it :(. I've been discussing with 
our network engineers about putting in a backup router and having two routers 
on each network segment. The servers (on the simple side) would include two 
equal cost default routes, one to each router on the ethernet segment 
(possibly across two switches as well). This would allow the network admins 
to do IOS upgrades without directly affecting the server's link to the 
outside world.

If anyone does [has done] any work/testing in this area it would be highly 
beneficial to me to hear the results :).

Thanks,
Jeremy