High availability (HA) is the buzzword of the day, but what does it mean? Often applied to Web services, it refers to the capability of a network service to “be there when you need it”—regardless of hardware or software problems.
HA may sound suspiciously similar to an old, more familiar friend: fault tolerance. And the two have something else in common: Both can be achieved by implementing intentional, carefully planned redundancy in a system. Fault-tolerant disk solutions, such as RAID, use multiple disks that are managed together and can continue to function after the failure of one disk. Likewise, the most popular means of providing HA is clustering—the use of multiple servers, managed as an entity that will continue to function if one server goes down.
In this Daily Feature, we will take a look at what server clustering is, how to determine whether you need it, and what it can do for your network.
What is server clustering?
A server cluster, also called a server farm, consists of two or more computers that function, are managed, and appear to the network users as a single entity. Each of the computers that belong to the cluster is called a node.
What a server cluster does
Clustering makes mission-critical applications available when needed. These include:
- Web services
- Databases, accounting programs, and other essential business applications
- File and print services
- E-mail and other communication services
Servers can be clustered to provide only fault tolerance, or the cluster can be configured to provide for high performance by spreading the processing load across multiple machines. In the second model, the nodes are all online simultaneously. In the first, which we will focus on here, only one node, called the primary node, is online. The secondary node “sits and waits” unless or until the primary node fails, at which point it takes over and continues to provide the resources. The whole process is transparent to the users. Let’s look at the details of this process.
How clustering works
To implement clustering, you must have at least two servers that are connected to the same network and running clustering software (discussed later in this article). Depending on the method used, you may also need a shared disk or disk array, to which both servers are connected.
There are three methods of making the information on disk (application programs, user data) available to all the servers in the cluster. The earliest method was called shared disks, which required special cables and switches. The biggest drawback was the necessity for applications to be modified for use in these situations.
Another method uses disk mirroring. Each of the servers has a separate physical disk. Software copies all data written to the disk on one server to the disks of the other nodes. Mirrored disks provide great redundancy but are somewhat difficult to manage. The biggest problem is that at some time before the data is replicated to the other disks, the nodes will not all have identical data. If a failure occurred exactly at that point, there would be a problem.
Microsoft and a number of other cluster services vendors use a method called shared nothing. Every node has its own disks, but if a failure occurs, the clustering software can “transfer ownership” of a disk. It has the same advantages as shared disks, but does not require special applications.
Your beating heart
All this is very well and good, but in order for a clustering solution to offer HA, the software must be able to automatically detect the failure of the primary node and transfer its responsibilities to a secondary node, without human intervention.
A popular means of detecting the “death” of a cluster node is by checking its “heartbeat.” The heartbeat is a high-speed link between the cluster members, over which they exchange status information and monitor one another’s activity. Each node typically has a NIC that is used for this connection (called “heartbeat traffic”) in addition to the NIC with which it connects to the network.
Do we have a quorum?
Another important issue, especially when there are more than two nodes in a cluster, is which node “owns” a resource and which one will take over providing that resource if the owning node fails.
One way to handle this, as implemented by Microsoft’s cluster services, is the concept of a quorum resource (a shared physical disk). Only one node can own this resource at a given time, but any node in the cluster has access to it. The disk used as the quorum resource must support hardware-based lockout. This disk is used to store the cluster log and database for cluster management.
Failover and failback
It is important to understand the terminology associated with clustering before you attempt to evaluate your clustering options or implement a clustering solution.
Failover occurs when a cluster resource fails. If it is a single application that fails, the cluster service tries to restart it—first on the same node and then, if that doesn’t work, on another node. If the primary node goes offline, the cluster software will failover all the applications. This occurs automatically when the cluster software detects the failure of a node or application. Administrators can also manually initiate the failover process.
Failback refers to the ability of the cluster software, after a failover, to restore the resources back to the node that was originally hosting them when that node comes back online.
Clustering software allows administrators to set policies governing whether failback is allowed.
There are a number of vendors providing clustering software. Some operating system vendors include clustering support in their high-end network operating systems. Third-party vendors offer add-on clustering solutions. The following are only a few of the many server-clustering implementations available.
Microsoft cluster services
Microsoft released their clustering solution, code name “Wolfpack,” as Microsoft Cluster Server (MSCS) with the Windows NT 4.0 Enterprise Edition. It also provided an API for the cluster service, which allowed for third-party applications that would be able to failover or restart to another cluster node when the primary node failed.
Windows 2000 Advanced Server and Datacenter Server include an updated and improved version of MSCS, called Windows 2000 Cluster Services. Advanced Server supports two-node failover, and Datacenter Server supports four-node failover. IP load balancing and COM-based applications are supported, as well.
More on MS
Novell cluster services
Novell NetWare 5.x supports clustering, using the NDS-enabled Novell Cluster Services. Up to 32 nodes can be deployed in a server cluster, and the product ensures distributed failover, even if multiple nodes fail. This keeps any individual node from becoming overloaded.
Clients are transparently reconnected to a remaining server when the node to which they are connected fails, and users’ mapped drives are even preserved. Novell uses what it calls “split-brain detection” to ensure that more than one surviving node will not attempt to mount the same disk volume of a node that has failed. The Cluster Interconnect Protocol (CIP) is used to verify node status.
More on Novell
For more information about Novell’s Cluster Services, see the Novell Web site.
NSI Software’s GeoCluster
GeoCluster is a third-party product that offers some advantages over other clustering implementations. GeoCluster is integrated with MSCS and provides a model in which each node has its own copy of all data, with real-time replication that occurs on a continuous basis over a LAN, VLAN, or SAN connection. Servers do not have to be located in close geographic proximity to one another.
For more information about GeoCluster, see the NSI Software Web site.
Cluster services for Linux and UNIX
There are many clustering products available to provide HA networking for Linux and UNIX servers, as well. These include:
- Piranha (Red Hat High Availability Server Project)
- TurboCluster for Linux
- SCO UnixWare 7 Nonstop Clusters
Is server clustering for you? Many of the popular clustering solutions are expensive, and the requirement for redundant hardware adds to the cost. However, if you have network applications that absolutely, positively must be available on a daily basis, downtime can have a huge impact on your bottom line.
Evaluate the estimated loss in revenue and productivity of having your mission-critical applications and data offline and compare it to the cost of implementing a clustering solution. You may find that just a few days or hours without those services could equal or exceed the cost of creating a HA solution for your network. Not to mention the value of your peace of mind: priceless.