Web Development

Should my application be highly available or fault tolerant?

Blogger Brad Bird describes a clustering solution that provides both high availability and fault tolerance for applications.

To remain competitive in today's marketplace, high availability of business applications and services is essential. Tantamount to death and taxes, server downtime is all but guaranteed.

Servers become unavailable for a variety of reasons but the major causes include:

1. Maintenance

2. Upgrade

3. Update (patch)

4. Accident

5. Power outage

6. Disaster

For these reasons, you want your business' dependent services and applications to be highly available so no interruption of service occurs and your customers continue to rely on you, which in turn, keeps revenue flowing.

It is possible to have hardware failure as a cause of a service interruption. When designing your infrastructure, ideally you would like the solution to be both fault-tolerant and highly available. A system designed to withstand a hardware failure is fault tolerant. Typically, a highly available system is fault tolerant. That said, it is possible to have a system which is highly available but NOT fault tolerant.

Consider the example of round robin DNS. In this example, we have three host computers configured to provide DNS service in a round robin fashion to provide high availability. In our example, we have Host A, Host B, and Host C. As DNS queries request service, they are handled in turn by Host A, Host B, and Host C. Now suppose Host B would fail. In this situation, one out of every three DNS queries would fail. In the round robin, the request to Host A would work, that to Host B would fail, and that to Host C would work. Requests would continue to be processed as such until either Host B returns to service requests, or the round robin is configured to use only the two remaining hosts.

The above example illustrates a highly available solution in that if maintenance would be needed, the round robin DNS service could be configured to take advantage of only the other two hosts. Since this solution cannot withstand a hardware failure, it is not fault tolerant.

Now, that is not to say that many things could be done to make this solution fault tolerant. One easy solution is at the client level. If a DNS query fails, it responds indicating that the name or IP address could not be resolved. If the client is configured with a Secondary DNS server despite the failure being received, a new query is sent to the Secondary DNS server with the potential of being resolved. This is one example of a fault tolerant solution.

On the server side, hardware could be configured in such a way to eliminate single points of failure. Candidates to look at are:

1. Power supplies: Have enough to withstand workload and one spare to withstand a failure

2. Network cards: Team network cards so that if one fails, the other continues processing

3. CPUs: Multiple CPUs

4. Memory: Hot spare parts work well here

5. Planar or main board: Spare server is really the only answer here

In the case of a main board failure, the only recourse is to have a spare server. To maintain high availability, more than one server handling the same workload is referred to as a cluster.

When used in this manner, a cluster provides both highly available and fault tolerant service for an application. Since servers configured as round robin DNS servers can also be referred to as a cluster, care must be taken when applying this terminology. The specific type of cluster being used for DNS can be referred to as a network load-balancing or NLB cluster. I have also heard these referred to as "front-end" clusters. NLB are highly available but are not always fault tolerant.

What is your experience with clustering? How would you plan for a fault tolerant or highly available solution? Please share your thoughts.

Need help keeping systems connected and running at high efficiency? Delivered Monday and Wednesday, TechRepublic’s Network Administrator newsletter has the tips and tricks you need to better configure, support, and optimize your network. Automatically sign up today!

About

Brad Bird is a lead technical consultant and MCT certified trainer based in Ottawa, ON. He works with large organizations, helping them architect, implement, configure, and customize System Center technologies, integrating them into their business pr...

14 comments
Jaqui
Jaqui

an either / or thing. a good developer will make an application BOTH, if that is in the requirements and they are allowed to by the twits that keep screwing them over.

alexisgarcia72
alexisgarcia72

I believe the best solution now and will be more of this in the future for fault tolerance and availability / load balancing will be the use of Virtualization Technology. Not only Vmware is king with the ESX 3.x and years of experience, but Citrix and Microsoft have now similar solutions (Xen & HiperV). We run 2 boxes with Vmware and with Vmotion we are able to provide maintenance, fault tolerance, patching and great performance / redundance systems and the user never notice it when we have problems or issues.

john.jelks
john.jelks

On our SQL server, we have server side configs 1-5 mentioned in the article to aid in fault tolerance (and RAID5). We use XOsoft's WANSynch to mirror the SQL db on another server. My concerns about mirroring are that any botched queries or data corruption will faithfully (and immediately) be replicated to the mirror server. We run a nightly SQL dump, but that is a backup procedure, not fault tolerance. Any other tools or ideas on keeping SQL available? (it's MSSQL 2005) I view this as a network engineer and wondering if I overlook another angle. btw, I'd have thought this topic would be burning up the board. ~outsourcing~ (there, that should fish 'em in). Thanks, JJ

jhoward
jhoward

For WAN fault tolerance expensive networking equipment makes a huge difference and is money well spent. On the public application side (DNS, HTTP etc.) it makes more sense to use something like Linux Virtual Server (Linux director) and Linux HA or some other load balancing software across redundant commodity hardware to manage clusters of application servers behind it. Today's commodity hardware has grown to include 1u servers with dual nics and hot swappable Terabyte drives for less than $700 US. The OS the applications run on does not matter since the apps are accessed via network protocols and ports which are managed by the load balancer. With some careful planning each individual server becomes less important and fault tolerance becomes less of an issue.

ken.donoghue
ken.donoghue

It's great the discussion about fault tolerance is on the rise. JV711 does help to elevate the discussion, but does not go far enough. FT UNIX systems aren't the only ones in the data center, and high availability and fault tolerance are different technologies. Right up front, I'm with Stratus, which has been making fault-tolerant servers for nearly 30 years. Today they are Intel-based and support Windown, Linux and VMware. These are not your father's fault tolerant servers. A fault tolerant architecture is designed to prevent hardware failures, unplanned downtime and data loss. They deliver the availability gold standard, of 99.999 percent uptime or better. High availability solutions are "recovery" solutions, ie. returning to service after the failure has occured. Clusters are not fault tolerant. They epxerience failover and data loss. With a tremendous amount of work and constant attention, an HA solution may reach 99.99 percent. The difference is significant when the average cost of downtime for the average company is about $150K. For mission-critical apps where no amount of downtime is acceptable because of data loss, compliance issues, customer service, lost productivity, this should be a concern to users. There are software-based fault-tolerant solutions ( i.e. 5 9s). They have a number of limitations, the most significant of which is they do not support SMP, which obviously limits workload and utility. A true fault tolerant hardware architecture is the equivalent to two standard servers in a single box. Everything is duplicated, except for the system clock. The two "halves" run in complete lockstep, doing the same thing at the same time, all the time. They are so tightly coupled that the OS and application see only one logical server (and only one license is required for most apps). If a component fails within the system, its mate is already on the job ... there is no hand-off, failover, restart, whatever. The application and user see no impact whatsoever. Everything continues to run uninterrupted and continues that way even during replacement of the failed component and the resynch of the two physical server halves. Those are the FT basics, without going into diagnostics, operational simplicity, root cause analysis, protection against transient errors, and a bunch of other stuff that distinguish FT from HA. Do all applications need FT? Certainly not. There is a premium to be paid for this level of availability, but far less than most would have you believe. HA clusters are actaually not much cheaper, and a whole lot more tempermental and, in the end, a lot more than most people bargain for.

JV711
JV711

Brad decent post that touches on the Windows perspective of HA/Fault Tolerance but let's step up the ladder to higher levels of Fault Tolerance found in datacenter Unix servers. Brad >> In the case of a main board failure, the only recourse Brad>> is to have a spare server. To maintain high availability, Brad>> more than one server handling the same workload is referred to as a cluster. HA/Fault Tolerance means to me "some hardware components can fail and the system will keep on functioning". My motherboards contain 2 or 4 (real physical) cpus, but the system chassis itself contains 1, 2, or more (sometimes 6) motherboards. I can have a motherboard or cpu fail, and the system will automatically reboot, note in it's console log that it put "cpu/mem board # 1" on the blacklist and will ignore it and continue on without skipping a beat. Human intervention was not required to continue functioning. In the running, active Solaris OS, an individual component, like a gigbit NIC in in slot #2 or the PCI-X card in slot #7 can be marked out and the operating system will ignore it and not use it. You can yank and hot-plug redundant power supplies, cooling fans, dvd drives, PCI cards, and especially hard disks. Of course these components will trigger an email as they slowly die or have died already (no complex SNMP infrastructure required). For instance, the Dell R900 of similar specs as the Sun M4000 both cost about US $29K list price off their respective web sites, but it's my wish that the predominantly Windows admins and IT managers of the world begin to see that Unix sets you free to create greater levels of availability to your users and applications. I pick on Dell because you mention them in your blog on http://owsug.ca/blogs/brad/default.aspx In the higher (much) more expensive class of Unix servers like the Sun 6800, you can WITHOUT halting the operating system, use "cfgadm" to offline a system board and evacuate processes from the memory on that board ("attention Oracle database: get out of the building and go to the cpu/mem board next door"). Then the offlined cpu/mem board will go into low-power mode and indicate with an orange LED that it's "safe to yank" that cpu/mem board from the running system. The Oracle database in this example was never unavailable to the end user, not even for a second. So HA/Fault Tolerance is elevated to a science and an art form in Solaris. Folks - try the Unix, you might like it. JV

john.jelks
john.jelks

This article starts in a good direction, but focuses on app access via LAN only. We have customers accessing apps from outside our network. This requires redundant Internet access. Two circuits are great but you also need circuit load balancing with true local DNS hosting (so one does not have to wait 1-6 hours for DNS changes made on the external hosting service to resolve). We use Radware's LinkProof boxes, one on each circuit. When a circuit drops (we are in a poor service area) the DNS transfer is instantaneous. One also needs enough public IPs on each circuit. Expensive hardware? Yes. But it's even more expensive when the apps are down and we get billed for missed commitments. John Jelks

josephg
josephg

Virtual Machines will not give you Fault Tolerance! Read Ken's response above. With VMware and Vmotion you have a very elegant way to move your application among servers. But that technology does nothing to protect the "inflight transaction". To do that you need something like Tandem's (now HP) Guardian OS, and its "NonStop" programming paradigm. Or in the alternative machines like the Stratus Computer family of machines or the FT VAX which provide(d) hardware based fault tolerance. What is the cost of losing that inflight transaction? I did work for Stratus in the 90's. Joe

JV711
JV711

Never knew about Stratus but I did have a Tandem in the early 90s with it's "lock-step dual cpu" architecture. Too bad Compaq massacred Tandem when Compaq acquired them in the late 90s. Also, clustering with commodity hardware requires bizillions of other Fault Tolerant pieces of the Infrastructure, like your load balancer is a $100K Cisco Director (highly fault tolerant equipment). If your load balancer is ITSELF is a $700 1U rackmount, "whatchugoingdo" when the load balancer is the target of a DOS attack, the rack power cord gets unplugged by the cleaning lady (Ive had this happen), or you simply get slashdotted/digg/way popular and suddenly you've got 100,000 incoming requests on a TCP/IP stack that is rated to handle a max of 1024 simultaneous requests and will run out of RAM and crash when it gets the 1025th? (try it sometime!!!) Also, I want emphasize that point about "in flight transactions" in a multiple tiered web/java/php/sqlplus/oracle transaction. I always roll my eyes or yawn when the cluster software salesmen tries to pass off "restarting an IIS webserver that serves up static webpages" or "NFS" as an example of a highly available application service. When it takes *10 minutes* for the 1TB database to perform an orderly shutdown and an equal amount of time to restart, you ain't fooling nobody that your shizznit is either fault tolerant or highly available or clustered. True fault tolerance costs $$$, no way around it.

saint_p50
saint_p50

Drop back ten yards and punt, brother! P.St.G.

Photogenic Memory
Photogenic Memory

You DNS comment was dead on! That's an interesting solution. Thanks for posting.

thomas.nilsen
thomas.nilsen

It's true that VMWare does currently not offer true fault tolerance, but rather High Availability. With ESX 4.x scheduled for release in 2009 sometime, it will be possible to have true fault tolerance on VMware guests, where everything is replicated from one server to another creating a hot-standy - including memory/cpu etc. More info on: http://download3.vmware.com/vdcos/demos/FT_Demo_800x600.html Thomas

thomas.nilsen
thomas.nilsen

It's true that VMWare does currently not offer true fault tolerance, but rather High Availability. With ESX 4.x scheduled for release in 2009 sometime, it will be possible to have true fault tolerance on a VMware guests, where everything is replicated from one server to another hot-standy - including memory/cpu etc. More info on: http://download3.vmware.com/vdcos/demos/FT_Demo_800x600.html Thomas

ken.donoghue
ken.donoghue

A five-nines server at about $13K doesn't seem an exorbitant amount for making sure a critical app doesn't go down. A full-blown data center server goes for around $45K. Tandems and Stratus' own Continuum systems are of a different era (hundreds of thousands to over a million bucks) -- although after all these years both lines are still keeping serious applications ticking.

Editor's Picks