To remain competitive in today’s marketplace, high availability of business applications and services is essential. Tantamount to death and taxes, server downtime is all but guaranteed.

Servers become unavailable for a variety of reasons but the major causes include:

1. Maintenance

2. Upgrade

3. Update (patch)

4. Accident

5. Power outage

6. Disaster

For these reasons, you want your business’ dependent services and applications to be highly available so no interruption of service occurs and your customers continue to rely on you, which in turn, keeps revenue flowing.

It is possible to have hardware failure as a cause of a service interruption. When designing your infrastructure, ideally you would like the solution to be both fault-tolerant and highly available. A system designed to withstand a hardware failure is fault tolerant. Typically, a highly available system is fault tolerant. That said, it is possible to have a system which is highly available but NOT fault tolerant.

Consider the example of round robin DNS. In this example, we have three host computers configured to provide DNS service in a round robin fashion to provide high availability. In our example, we have Host A, Host B, and Host C. As DNS queries request service, they are handled in turn by Host A, Host B, and Host C. Now suppose Host B would fail. In this situation, one out of every three DNS queries would fail. In the round robin, the request to Host A would work, that to Host B would fail, and that to Host C would work. Requests would continue to be processed as such until either Host B returns to service requests, or the round robin is configured to use only the two remaining hosts.

The above example illustrates a highly available solution in that if maintenance would be needed, the round robin DNS service could be configured to take advantage of only the other two hosts. Since this solution cannot withstand a hardware failure, it is not fault tolerant.

Now, that is not to say that many things could be done to make this solution fault tolerant. One easy solution is at the client level. If a DNS query fails, it responds indicating that the name or IP address could not be resolved. If the client is configured with a Secondary DNS server despite the failure being received, a new query is sent to the Secondary DNS server with the potential of being resolved. This is one example of a fault tolerant solution.

On the server side, hardware could be configured in such a way to eliminate single points of failure. Candidates to look at are:

1. Power supplies: Have enough to withstand workload and one spare to withstand a failure

2. Network cards: Team network cards so that if one fails, the other continues processing

3. CPUs: Multiple CPUs

4. Memory: Hot spare parts work well here

5. Planar or main board: Spare server is really the only answer here

In the case of a main board failure, the only recourse is to have a spare server. To maintain high availability, more than one server handling the same workload is referred to as a cluster.

When used in this manner, a cluster provides both highly available and fault tolerant service for an application. Since servers configured as round robin DNS servers can also be referred to as a cluster, care must be taken when applying this terminology. The specific type of cluster being used for DNS can be referred to as a network load-balancing or NLB cluster. I have also heard these referred to as “front-end” clusters. NLB are highly available but are not always fault tolerant.

What is your experience with clustering? How would you plan for a fault tolerant or highly available solution? Please share your thoughts.

Need help keeping systems connected and running at high efficiency? Delivered Monday and Wednesday, TechRepublic’s Network Administrator newsletter has the tips and tricks you need to better configure, support, and optimize your network. Automatically sign up today!