Disaster recovery (DR) systems focus on transferring your data in real time to another physical location so that the loss of a data center does not mean the loss of corporate data assets. High availability (HA) systems take this idea and implement it as a way of preserving server uptime during more common hardware and software failures.

The most common example of HA systems in the world of Windows network operating systems is the cluster, where two (or more) servers can stand in for each other as needed while maintaining the identity of one “virtual server” that can be passed back and forth between them. Similar systems exist in the UNIX space; however, I’ll focus here on the Windows world, since Windows-based file/print and other servers are generally much more vulnerable to fail without the loss of the underlying data-center structure.

Working with MSCS
Although not the only solution available to create an HA environment for Windows, the cluster is perhaps the longest-running methodology. Originally introduced in Windows NT Enterprise Server 4.0, Microsoft Cluster Services (MSCS) has evolved over the years into the versions now found in Windows 2000 Advanced Server and Datacenter Server.

Essentially, two servers (up to eight in the Datacenter product) are attached to a common disk system. This can be a simple SCSI array (on the low end) up to a multiple-disk SAN solution (on the high end). Commonly, these links to the disk subsystems must be via SCSI or Fiber Channel connections, meaning that there are maximum distance limitations between the two servers, or “nodes.”

In addition, you must also ensure that all applications installed on the cluster are cluster-aware—that is, that they have been specifically designed to be managed by MSCS systems and will function normally when moved from one node to the other during an application failure.

There are products available today that can alleviate some of the more noticeable drawbacks with MSCS for HA. Available tools allow you to use local storage or independent disks attached to each node, while maintaining all other features of the cluster. This eliminates both the single point of failure (the single shared disk resource) and—to a great extent—the distance limitations inherent in the MSCS design.

Again, applications must be cluster-aware, even when using these extensions to the Microsoft solution. Another potential drawback is that both nodes must be on the same logical IP subnet, so if you want to stretch the cluster beyond the boundaries of the LAN, you will need to bridge or do some other form of IP translation.

Other clustering options
For those cases where application failover is not the primary concern, many tools exist that allow you to create cluster-like functionality without the expense and expertise requirements of an MSCS cluster. The most common example of this scenario is a file/print server. Essentially, the only thing that would take out the file/print services would also keep the server from being able to communicate, so application-level monitoring is not necessary. Therefore, server-level monitoring is sufficient to protect the data and ensure high availability to client connections.

In this case, an underlying replication engine is used to keep the data files between the two servers in a quiescent state (and this is often part of other third-party clustering software), while another module monitors the health of the primary server. In the event that the primary server becomes unreachable, the secondary server can assume IP address, NetBIOS name, and other vital information so that clients can immediately connect to their data without reassigning server names in client-side apps. They essentially won’t recognize any change.

With some exceptions, you can use these systems to extend this cluster-like functionality to multiple IP subnets, as long as you’re either failing over only NetBIOS names or are willing to install static routes in your routers. Either way, you can create HA systems that can be stretched to more than one physical location without losing protection.

Of course, nearly all HA solutions are server- or host-based, which you have to think about when working with application-based servers. Many systems can allow one server to stand in for more than one primary machine, but there are severe limitations in cases of single-engine applications like Microsoft SQL Server and Exchange Server. (For information on clustering Exchange, see my article “Deploying GeoCluster to mitigate Exchange disasters.”)

The decision of how to create an HA plan should, of course, not be approached lightly. However, once a well-devised plan is put into place, it will significantly reduce downtime and increase the protection level of your digital assets.

When combined with a DR solution, HA can often create close to the mythical five nines (99.999%) uptime that most corporations are currently seeking. This scenario combines HA systems that pick up immediately if hardware or software fails and DR systems that can quickly restore vital data and IT functionality if a data center fails or succumbs to a physical disaster.

Have a comment or a question?

We look forward to getting your input and hearing about your experiences regarding this topic. Post a comment or a question about this article.