Providing uninterrupted application and resource availability is a key requirement for many businesses, and server clustering is a primary means for achieving that goal. In my article “How to set up an MSCS cluster,” I explained how to install the Microsoft Cluster Service (MSCS) and get a server cluster off the ground. Setting up the cluster nodes and forming the cluster is just the first part of the process, however. You also need to configure clustered services and applications, and then test and monitor the cluster.
What does MSCS do again?
The MSCS, which is included with Windows 2000 Advanced Server and Datacenter Server, allows you to build server clusters to provide high availability for applications and other network resources, such as file shares. Advanced Server supports clusters of two nodes (servers), and Datacenter Server supports clusters of up to four nodes. MSCS can use the Network Load Balancing (NLB) service, which is also included with both platforms, to provide load balancing for IP-based services. MSCS does not provide dynamic application load balancing, however.
Administering the cluster
The Cluster Administrator, found in the Administrative Tools folder, is your control center for configuring and managing the server. Although it looks like an MMC console, it’s actually a stand-alone application, so it can’t be integrated with your other MMC-based management tools. However, you can perform almost all cluster-management tasks from the Cluster Administrator.
When you open the Cluster Administrator, you’ll see a two-pane, MMC-like interface. If you open Cluster Administrator from a node in the cluster, the application displays that cluster by default. However, you can connect to and manage other clusters on the network, as well. Cluster Administrator runs under Windows NT and all Windows 2000 platforms. You’ll find the Cluster Administrator in the %systemroot%\cluster\ folder on the cluster node. You can copy the contents of the folder to a workstation and run the Cluster Administrator from there by executing the Cluadmin.exe command. To connect to a cluster, open Cluster Administrator and choose File | Open Connection. Cluster Administrator prompts you for the cluster name or server name of a node in the cluster. You’ll need connectivity to and name resolution for the network in which the cluster resides.
One of the tasks you’ll need to perform after you install the cluster nodes and form the cluster is to configure the services and applications that you want made available through the cluster. For example, you might want to set up clustering for a distributed file system (Dfs) share, print server, Web server, or application such as Exchange Server or SQL Server. You accomplish much, if not the entire configuration, through the Cluster Administrator. Before you jump in, however, consider some of these issues.
MSCS does not by itself make applications cluster-capable. Applications that are not cluster-aware can’t take advantage of the high-availability features afforded by clustering, but they can still run on a node in the cluster. For example, you might have a server application that does not support clustering or failover, but you can still run it on a server node. If the server fails, however, the nonclustered application will be unavailable for client access until the server comes back online. However, running nonclustered applications alongside cluster-aware applications on a node lets you take advantage of server resources and potentially reduce hardware investments.
Cluster-aware applications are those that recognize the cluster and can react to cluster events such as a node shutting down. For example, when the primary node for an application goes offline, the application running on the secondary node receives notification of the failure, retrieves current state data from the quorum disk, and takes over for the failed server. In most cases, the users at most see a slight delay in processing.
Other cluster applications, such as Cluster Administrator, allow you to view and manage the cluster. Others can be used to extend and monitor cluster services. The Cluster Automation Server, for example, provides a set of COM objects you can use to build cluster-aware applications and management tools. All cluster-aware applications reside above the Cluster API and Cluster Service and interact with underlying components such as resource monitors, the cluster database, the cluster network driver, and so on. Applications that are not cluster-aware interact with the cluster resources at a lower level, which is why they can’t take advantage of the benefits offered by the Cluster Service.
To begin managing a cluster, you first need to understand the objects with which you’ll be working. I covered two cluster objects—network interfaces and network connections—in “Preparing to install an MSCS Cluster.” The networks provide for internal communication between nodes and external communication with clients. One of the tasks you can perform through the Cluster Administrator is to manage the network interfaces and related settings for the cluster.
Nodes are another type of cluster object. As you’re probably already aware, a server node is a computer with Windows 2000 Advanced Server or Datacenter Server installed along with the Cluster Service and by definition is a member of a cluster. All of the nodes in a cluster share a common cluster name, although each has its own computer name. Nodes communicate with one another to share state data and can detect the cluster resources running on other nodes. Cluster Administrator lists the nodes in one of several states, according to the following list:
- Up—The node is online and functioning normally in the cluster.
- Down—The node is not participating in the cluster. (Note that the node could still be online.)
- Joining—The node is in the process of joining the cluster to become an active node.
- Paused—The node is online and active but can’t take ownership of resource groups or bring resources online.
- Unknown—Cluster Administrator can’t determine the node’s condition, possibly because of a server or communications failure.
A cluster resource is a physical or logical cluster component such as a physical disk, DHCP or WINS service, print spooler, file share, application objects, and so on. The active node can bring resources online or take them offline, but resources can only be hosted and managed by one node at a time. In addition, one resource can be dependent on another. For example, an application that needs to use a clustered disk would be dependent on a physical disk resource. The application is therefore the dependent resource. The Cluster Service takes dependent resources offline before it takes offline those resources on which they depend.
In a similar fashion, the Cluster Service brings dependent resources online after bringing up the resources on which they depend. For example, a clustered application that stores its data on a shared disk is dependent on that disk. So, the Cluster Service brings the disk online first, then the application. This ensures that the required resources are available when the dependent resource comes online.
A group is another type of cluster object. Groups are collections of cluster resources and define failover behavior for resources. If a resource in a group fails, MSCS moves the entire group in which the resource belongs to another node. A resource is owned by only one group at a time, and a group is owned by only one node at a time. Usually, a group contains related and/or dependent resources, but you can create groups that contain unrelated resources to balance resource usage and server load and to simplify administration. Each group maintains a list of nodes that can serve as host for the group. This prioritized list allows the Cluster Service to determine which node is qualified to host a group and which one should be used next in the event the active node goes offline.
Cluster Service relies on virtual servers, another type of cluster object, to enable services and applications to fail over to other nodes. A virtual server is a group that contains information about the service, including a network name resource, IP address, and the applications served by the virtual server.
Understanding failover and failback
Failover is the process by which a clustered application or resource is transferred to another node in the event the active node goes offline or fails. Failback is the process of restoring the application or resource to its original node when that node comes back online.
If an application fails on the active node but the node itself is healthy, Cluster Service usually attempts to restart the application on the original node. Failing that, Cluster Service fails the application, moving its resources over to an alternate node and restarting the application on the alternate node. The Cluster Service generates a failover if the active node becomes inactive, such as going offline or hanging; a resource in the group fails that causes the group to fail; or an administrator forces a failover, as would be the case when you need to perform an upgrade on the active node. You can configure the properties that determine when Cluster Service attempts a failover and the criteria it uses to accomplish the failover.
When a failover does occur, Cluster Service takes offline all the resources in the affected group, in an order dictated by their dependencies. This allows functioning resources to complete operations such as disk writes prior to those resources being taken down. When Cluster Service has taken all of the resources successfully offline, it transfers the group to the node specified in the prioritized list of preferred nodes for the group.
When the resources have been transferred, the Cluster Service brings those resources online, again in order determined by dependency. The failover process is complete when all resources are back online on the alternate node and clients can access the application. Cluster Service attempts the failover a given number of times as defined by the failover policy for the group. You can control through the Cluster Administrator the properties that define that failover policy.
When a cluster node comes back online, Cluster Service can fail back the group(s) formerly hosted by the node to that node. Cluster Service uses the same process for failback as it does for failover, taking down the resources on the alternate node, moving the group to the primary, and restarting it there.
Deciding on a cluster model
Before you dive into cluster configuration, you need to examine your applications and services and decide how to structure your cluster. You have a couple of models from which to choose depending on the makeup of your applications and services and which ones need to be protected by failover.
Although you could install MSCS on a single computer and use it to organize resources through virtual servers, this method provides no redundancy and does not address resource availability. If the server goes down, its applications and other resources go down. So, although a single-node cluster is possible, it hardly seems worth the trouble setting up Cluster Service simply as an administrative tool. So, let’s assume you’ll have at least two servers in your cluster.
In the first model, which I’ll call a distributed cluster, you divide each resource or application among the nodes in the cluster, with each node handling a portion of the workload. Each is therefore primary for its share of the workload and is configured to fail over to another, secondary node. The benefit of this model is that in addition to high availability, you also achieve static load balancing of the application or resource. For example, assume you decide to use this model for Exchange Server. You install a cluster of two nodes, with each server hosting half of the total users. Each node is designated for failover to the other node, so in the event of a failover, a node takes on the responsibility for all users.
In the second, or hot spare model, a single node performs all server functions, hosting all clustered applications and data. The secondary node sits idle unless a failover occurs, in which case the secondary node takes on the full load from the primary node. This model offers the best performance in the event of a failure because the secondary node is not tasked with handling its original workload in addition to the load from the failed server, assuming the server nodes are comparable in terms of capacity. The downside is that you have a server—probably an expensive one—idle 95 percent of the time. For mission-critical operations, however, this is the best solution.
You can also create a hybrid model in which certain applications or resources are configured to fail over to a secondary node while others are not. In this node, you configure only critical resources to fail over to secondary nodes but noncritical applications and resources do not fail over. The benefits to this model include reduced load imposed on the secondary node, quicker failover, and the ability to make use of what would otherwise be an idle server.
The model you choose depends on the applications and resources you need to make available to clients, as well as which ones need to be protected by failover, which ones can be offline for a period if need be, the capabilities of your server hardware, and several other factors. Before you start setting up your cluster, examine these issues and decide exactly what you want the cluster to do for each application and resource in the event of an unplanned outage or scheduled upgrade. With that road map in hand, you can begin setting up applications on the server nodes and establishing the cluster failover policies for each group of resources.