Whether you are providing Web services to your company or customers, hosting critical applications and data, or needing a bulletproof e-mail infrastructure, load balancing and cluster services can be the key tools for providing reliability and fault tolerance. Microsoft offers several solutions to help you build a server topology that scales well as workload increases and provides fault tolerance and failover capabilities for your mission-critical services. In this Daily Drill Down, I’ll explain Microsoft’s clustering and load balancing technologies to help you understand and apply them wisely in your enterprise.

Fault tolerance and redundancy—not just buzzwords
If you provide Web-based services for your customers, you no doubt already understand that your customers expect those services to be available 24 hours a day, seven days a week—whether they need a driver, access to an online catalog or database, or just have an FAQ. But companies that host services on the Web aren’t the only organizations that need fault tolerance and redundancy. Your company might not even have a public Internet presence, but you still provide file sharing, e-mail services, and other critical services to your users. Having those services go down can have quite a negative impact on your business. For example, what would happen if your order processing system suddenly went down and was offline for a few days because you hosted it on a single server and that server died? What happens if a critical file server goes down and users can’t get to their documents for a day? Whose head will roll when the SQL server where all the customer records are kept goes down because of a fried motherboard and you have to wait two days for parts before you can get it back online?

All of these scenarios are certainly possible. However, many companies don’t realize how vulnerable they would be if their key servers went down. Having your e-mail server or key database offline for a few hours doesn’t always sound like a big deal until you realize that you might be idling an entire department for that amount of time. Multiply that by a day and you can begin to understand the consequences of a lack of fault tolerance and redundancy on your network. Even slowdowns can have an impact—the longer your users wait for a server to respond, the less time they spend getting their jobs done, which reduces productivity and profits and ultimately affects your paycheck in one way or another.

How you increase reliability and availability for services across the enterprise depends in part on the types of services you offer, whether those services are Web related or not, and the criticality of those services. If improving performance for IP-based services is you main concern, network load balancing is a good solution that can reduce overall load on individual servers and thereby improve response time, which will make your users and customers happier. Network load balancing can also provide a limited amount of redundancy to ensure that services remain available even when a server in a load balanced group goes offline.

Network load balancing doesn’t serve the same purpose as clustering, although there is some overlap between the two services. Load balancing is targeted at increasing performance, while clustering services are targeted at improving redundancy and therefore reliability and availability. For more information about network load balancing, see the Daily Drill Down “Understanding Windows 2000 network load balancing.”

Server clustering is another option for services that are not IP based or where you need additional failover capability, such as for file and database servers. A cluster is a group of servers that run a common application set and present a single logical presence for a given service. For example, you might set up a cluster of two servers running SQL Server to ensure that if one server fails, the other will continue to function and provide services to users while you repair the second one.

In some ways, a server cluster is a little like redundant array of independent disks (RAID). With RAID, multiple physical drives work together to contain a file system. In a RAID 5 array, for example, data is striped across multiple drives, with one drive serving as a parity drive. If a drive in the array fails, you can replace that failed drive and the array rebuilds itself. If the array contains a hot spare, the array can add the spare to the array and rebuild itself without any intervention. The result is that the logical volume and data remain available.

In a server cluster, an active server can fail or be taken offline without affecting the service being provided by the cluster. In our SQL Server example, you might take the active server in the cluster offline for maintenance for several hours, but because the other server in the cluster remains online and takes over the task of serving SQL requests, customers and/or users never know the server is offline. So a cluster provides fault tolerance by allowing other servers in the cluster to take over the workload for a failed server.

Microsoft Cluster Services
Microsoft Cluster Services (MSCS) is included with Windows 2000 Advanced Server and Windows 2000 Datacenter Server. MSCS supports clustering of up to two nodes under Windows 2000 Advanced Server and four nodes under Datacenter Server. It’s also available as an add-on service for Windows NT Server for up to two nodes. Microsoft Cluster Services started as a product code-named Wolfpack for Windows NT. Microsoft changed the name to Microsoft Cluster Server and then renamed it to Microsoft Cluster Services in its current implementation. Applications must be cluster aware to function under MSCS.

In an MSCS cluster, servers each have their own file system for the operating system but share a storage system for clustered applications and data. The primary function that MSCS provides is to ensure application availability through failover. Failover is the ability of the cluster to move application processing from one server in the cluster to another when a hardware or application failure occurs. For example, if one of the servers in our fictitious SQL Server cluster fails, the transactions being handled by the failed server can migrate to a healthy server in the cluster. When a server comes back online in the cluster, the application can fail back to the original server. So, although failback might seem like a negative action, it’s actually a positive one—it’s a server returning to work in the cluster after downtime, whether planned or unplanned.

MSCS provides stateful clustering. When a server goes offline, the server fails over to another server in the cluster without losing the data associated with each failed application. Therefore, in stateful clustering, the cluster maintains the user and application state during a failover, with the user and application state failing over to the other server. This means that users who access an Exchange server cluster will not lose access to their mailboxes or other Exchange features if their active server in the cluster goes down, even if they have an open connection to the server when the failure occurs.

As with NLB, an important advantage to MSCS is your ability to perform rolling upgrades without taking a service offline. For example, assume you’re running SQL Server in a cluster and need to upgrade the software to a new version. You take one server out of the cluster and any current connections fail over to the remaining nodes. You upgrade the server and then restore it to the cluster, where it resumes its duties. You then sequentially upgrade the other servers in the cluster in the same way. At the end of the process, you have upgraded all your servers while still providing uninterrupted service to your users or customers.

Although MSCS is a great tool for ensuring high availability for critical applications and data, it isn’t really geared toward providing scalability for adding users. MSCS doesn’t provide dynamic load balancing, and unless an application can be configured to handle a subset of users or data, only one server is active for a given clustered application. For example, in the case of Exchange Server, you can’t have two Exchange servers in a cluster serving the same group of users at the same time to provide load balancing. You can, however, have two Exchange servers running in a cluster, each handling a specific subset of users that you specify when you set up the server. Here’s an example: Assume that you manage e-mail for two different companies. You want to provide clustering to ensure that each company’s e-mail service is always operational. You create a cluster of two Exchange servers, with server A managing company A’s users and server B managing company B’s users. If server A fails, server B takes over for it, serving company A. When A comes back online, those users migrate back to server A.

However, assume that you have a cluster-aware application that can’t divide its users or data. In such a case, the application is active on only one server in the cluster. If that primary server fails, another server that you have designated as a secondary server for the application takes over from the primary. In this scenario, only one server at a time is active for the clustered application.

Component Load Balancing
Another option for clustering in the Windows 2000 environment is Component Load Balancing (CLB), which is a component of Microsoft’s Application Center 2000 that provides load balancing for COM+ applications. COM+ combines Microsoft’s Component Object Model (COM) with Microsoft Transaction Services (MTS) to provide a mechanism for deploying and supporting component-based applications. In effect, COM+ allows you to create modular and distributed applications. For example, you might develop an order processing system that uses COM+ objects to handle various aspects of the order entry, lookup, and processing functions. If those COM+ objects are complex, they can often impose a significant load on a server, particularly if the server is handing other COM+ objects and other services such as IIS, IAS, and so on.

CLB allows you to scale your COM+ application across the servers in a cluster to a maximum of 16 nodes. CLB distributes COM+ objects across multiple servers in a cluster to provide load balancing—each COM+ object can be running on multiple servers in the cluster and therefore handling a percentage of the overall load. CLB also provides failover support, because if a server in the cluster fails, the remaining servers can take up the slack for the missing server.

CLB regularly polls the servers in the cluster to determine their response time, which is an indirect indication of how much load the server is under at that particular time. CLB orders the server list based on each server’s response time to the poll and sends COM+ activation requests to the servers based on that order. At the next polling interval, CLB reorders the list again and the process continues. CLB is network intensive, so you if you decide it might be an option for your company, you should deploy CLB on its own 100-Mbps or higher network segment to maximize performance and reduce the impact on other network services (and vice versa.)

CLB is not available as a component separate from Application Center 2000. Unlike NLB and MSCS, you can run CLB on Windows 2000 Server because Application Center 2000 supports Windows 2000 Server. NLB and MSCS both require Advanced Server or Datacenter Server. CLB does not require shared storage like MSCS.

Application Center 2000
The final option for load balancing and clustering under Windows 2000 is Microsoft Application Center 2000, which is designed primarily for creating and managing Web farms. A Web farm is a collection of Web servers functioning as a cluster to present a common Web presence such as one or more Web sites, e-commerce sites, etc. Application Center runs on Windows 2000 Server, Advanced Server, and Datacenter Server. Under Windows 2000 Server, however, Application Center requires the use of a third-party IP load balancing service because NLB is not included with Server. Application Center supports clusters of up to 16 nodes and does not require that server applications be written to support it.

Application Center uses NLB to provide general load balancing and also includes CLB, discussed previously, for load balancing COM+ applications. With MSCS, servers can be configured independently and run different applications, such as Exchange Server on one and SQL Server on another. Application Center takes a different approach. You create an initial cluster controller that manages the cluster configuration information and contains all of the content for the Web services. When you add other servers to the cluster, Application Center essentially clones the cluster controller’s configuration and content to the other servers. The result is a cluster of Web server clones all functioning as a logical unit, providing load balancing and failover.

One of the primary benefits of using Application Center is the ease with which you can add nodes to the cluster, manage existing nodes, deploy new content, and manage the cluster overall. Another benefit is that Application Center handles all of the configuration for NLB for you, which saves a considerable amount of work. You can run the Application Center Administrator on all Windows 2000 platforms, as well as Windows NT 4.0 Server or Workstation to manage the cluster, giving you a good range of options for managing your Application Center clusters.

Before you jump on the Application Center bandwagon and start planning your deployment, however, you need to think about the potential costs. Microsoft uses per-CPU licensing for Application Center at a retail cost of $2,999 per CPU. This means each dual-processor server in a cluster will require two licenses at a cost of almost $6,000. Multiply that times the number of servers in your Web farm, and you can begin to see that Application Center requires a considerable financial commitment. Given its flexibility and power, however, it can be a bargain if server availability is critical to your operation.

Conclusion
A network isn’t very useful if users can’t access the data that you’re storing on your servers. Although you can provide redundancy built into the server with things like RAID, sometimes that’s not enough. Sometimes it’s useful to have completely redundant servers. Fortunately, you can provide such redundancy by using the clustering services that are available for your Windows servers.