Distributed File Systems
Jeremy Rumpf
jrumpf at heavyload.net
Sun Oct 20 17:12:43 EDT 2002
On Sunday 20 October 2002 04:07 pm, David Chait wrote:
> Jeremy,
> While that would resolve the front end problem, in the end your mail
> partition would still be a single point of failure. I'm trying to find a
> way to do a real time replication of the mail partition between both
> machines to allow for a complete failover.
Yes, that's true. I believe then you're looking for a true geographically
distributed system for failover. For me, this would blow my current cost
margin :(. Some good reading may be at:
http://lists.canonical.org/pipermail/kragen-tol/2002-January/000655.html
http://www.usenix.org/publications/library/proceedings/usits97/full_papers/christenson/christenson_html/christenson.html
http://www.dsg.cs.tcd.ie/~vjcahill/sigops98/papers/saito.ps
http://www.hpl.hp.com/personal/Yasushi_Saito/pubs.html
Most of the above describes nonexistent transaction based geographic mail
distribution. For now, we're forced to work up work arounds to the issue. I
personally was considering into looking into the use of InterMezzo to attempt
to accomplish this (from a purely experimental standpoint).
http://www.linuxplanet.com/linuxplanet/reports/4368/1/
For production use though, I've chosen to harden the heck out of my
installation. Perhaps other folks be interested in sharing what their main
sources of downtime are? For the systems I run they would be:
1> Downtime due to OS level software patches/upgrades.
2> Power supply failures (yes even one of my "carrier grade" sun 200R's just
lost a power supply last month).
3> Dead disk drives (brain dead admins before me didn't understand disk
mirroring).
4> Facility power maintenance (both the UPS and building power). Two months
ago we had a power outage because they had to lubricate the contactors in the
main panel.
5> Network outages.
I've been working to resolve most of the above issues.
1> The failover solution I previously described allows me to alleviate this
issue.
2> I've got dual power supplies for almost all of my hardware now. One supply
is plugged into the UPS on one circuit, the other is plugged into building
power (through a filtering surge suppressor).
3> I'm using hardware RAID where I can, other areas are using software RAID on
across separate SCSI controllers (on machines where hardware raid isn't
directly cost effective).
4> When they have to maintenance the UPS I go to the power supplies on
building power. When they have to maintenance building power I try to
coordinate things such that the UPS will hold long enough to cover the
outage.
5> Network outages are hard to compensate for because our backbone feed into
the main data facility isn't designed for this. We have one main core router
with all external fiber feeds running into it :(. I've been discussing with
our network engineers about putting in a backup router and having two routers
on each network segment. The servers (on the simple side) would include two
equal cost default routes, one to each router on the ethernet segment
(possibly across two switches as well). This would allow the network admins
to do IOS upgrades without directly affecting the server's link to the
outside world.
If anyone does [has done] any work/testing in this area it would be highly
beneficial to me to hear the results :).
Thanks,
Jeremy
More information about the Info-cyrus
mailing list