Creating a server cluster is a great way to provide scalability and fault tolerance to server based applications. Even so, the expense and complexity involved in creating a cluster means that clusters warrant a lot of careful planning. In this article series, I will guide you through this planning process.
Before I begin
Before I get started, I want to mention that there are at least three different types of clusters that are supported by Windows Server 2003:
- Network Load Balancing Clusters
- Computational Clusters
- Server Clusters
When I refer to clusters in this article series, I will be talking about Server Clusters, not Network Load Balancing or computational clusters.
In case you are wondering, Network Load Balancing clusters are usually used to provide scalability or fault tolerance to Web based applications. Therefore, if you are thinking of clustering a Web server, then you will probably want to create a Network Load Balancing cluster rather than using the planning techniques that I will be discussing in this article series.
Computational clusters are a special type of cluster in which multiple servers can work together to reduce the amount of time that complex calculations take to complete. These types of clusters are best suited to scientific or analytic applications, although I can't help but wonder how a computational cluster could be put to work for playing video games.
Server clusters, such as the ones that I will be discussing in this article series are a general purpose type of cluster. They are best suited to hosting servers that are running database applications.
An introduction to cluster planning
Clusters are fairly common, so you might be wondering why they merit so much planning. The reason is because of the very nature of server clusters. As I mentioned earlier, server clusters are best suited to servers that are hosting database applications. When a server is hosting a database application, there are typically very frequent transactions made to the database itself. These transactions tend to be problematic for a cluster.
Imagine for example that you have three servers that are all hosting the same database application. In a situation like this, you never know which server a user is going to connect to. If the user were to add or modify a database record, then the update to the database would be applied to the copy of the database that is running on the server that the user connected to. The other two servers that are also hosting the same application would remain blissfully unaware of the update. As you can see, even if each server were to start out with an identical copy of the database, it would not take long for the three servers to each have very different versions of the data.
As you can see, continuity of data is a major concern for clustered database application servers. Clustering a database application server simply will not work unless there is a way of guaranteeing that each server in the cluster always has access to the same data set.
When Microsoft created server clusters, there were a couple of different ways that they could have taking care of the data continuity issue. One option would have been to immediately replicate any changes to a database to all the other copies of the database. Ultimately though, Microsoft chose not to use this approach because it has several problems.
First, latency has the potential to affect the databases integrity. For example, what would happen if someone were to update a database record, and then another user were to make a different change to the same record, but on a different server, before the first change could be replicated?
Another issue is bandwidth. In an environment in which changes are being made to a data set on a frequent basis, replicating those changes between multiple servers can consume a tremendous amount of bandwidth.
Because of these and other issues, Microsoft chose not to use the replication method for server clusters (although other Microsoft products use this method to maintain consistency between multiple copies of a non-clustered database). Instead, Microsoft chose to solve the data continuity problem by having all cluster nodes share a single copy of the database. Typically, this means that each node in the cluster has a direct connection to a centralized storage system that is shared by all of the cluster nodes.
It is this shared storage that makes designing a cluster as complex as it is expensive. One of the reasons why is so much planning is in order is because clusters are designed to be fault tolerant. It is difficult for a cluster to be truly fault tolerant though if all of the nodes share a common storage system.
Of course you can mitigate the effects of a hard disk failure by implementing a RAID array (which is also a good idea from a performance standpoint). However, just implementing a RAID array does not guarantee true fault tolerance. For example, suppose that the RAID array's power supply were to fail. This failure would effectively undermine the cluster's fault tolerant capabilities. Even if the storage system contained fully redundant hardware and was on a backup generator there are still things that could happen that would cause the storage system to fail.
One example of such a situation is that the facility containing the storage system could be destroyed during a hurricane, fire, etc. A less dramatic, but equally disruptive situation involves database corruption. If the database were to suddenly become corrupt for some reason, the cluster would come to a grinding halt until you restored the database from backup.
The point is that no matter how much work you put into planning a cluster, there are always going to be situations beyond your control that could theoretically cause the cluster to fail. Therefore one of the most important considerations in planning the cluster should be whether the cost of downtime warrants the cost of the cluster.
I have been in IT for long enough to know that although the statement is sometimes unrealistic, no amount of downtime is acceptable. That being the case, what I'm about to tell you probably won't be very popular with most of you who are reading this. Even so, I firmly believe that it is very important to look at cluster implementation cost and the cost of downtime from a business standpoint, not just from the standpoint of some idiot manager who tells you that downtime is completely unacceptable.
Facility considerations
To see why this is the case, let's go back to my example in which a hurricane destroys the facility containing the cluster storage system. For enough money, you could create a backup storage system in another city (preferably away from the coast). You could also devise a method of keeping that backup storage system somewhat current, and automatically rerouting the cluster to use the backup storage system in the event that the primary storage system fails. Obviously, this type of failover system would cost a huge amount of money.
Suppose however that the database application that the cluster is hosting is not something that's accessible over the Internet. Instead, it hosts a business critical database that is only used internally by employees of your company. If all of your employees work in the same city as the primary storage system resides in, then having a backup storage system in another city is probably pointless. If the city were to be devastated by a hurricane, then what you think the chances are of the main office even having electricity? Even if you were to bring in generators and some of the computers were still functional, what do you think the odds are that employees will actually come into the office? If you have never been through a hurricane, I can tell you that the chances are of the employees coming to work are pretty much the row, because mandatory evacuations.
My point is that it is possible to spend a fortune on the underlying infrastructure that would make a cluster continue to function during times of crisis, but doing so is not always a wise investment. It does little good to spend good money ensuring that a cluster never fails if no one is even able to benefit from the cluster's availability in times of crisis.
Other cost considerations
Having said that, let's look at the cost issue from a different perspective. Suppose for a moment that a server at your company's headquarters hosts a database application that is considered to be mission critical for the company. Because the database is so important, the company decides that it would be wise to cluster the server in an effort to increase its performance and to provide a degree of fault tolerance. Let's also assume that users from other offices in other cities also use the database and that these users place a considerable strain on your WAN link by accessing the database application from a remote site.
If that were the case, then it might seem logical to create a geographically dispersed cluster. By doing so, you can allow users to access the application from servers located within their own facility, thus improving the end user experience, and relieving the congestion on your WAN link.
Forgetting about fault tolerant issues, the biggest problem with this particular deployment scenario would be the cost. First, you would have the expense of purchasing server hardware and software for each cluster node. The cost of server hardware and software would pale in comparison to the connectivity cost though.
Even in a geographically dispersed cluster, all of the cluster nodes must maintain reliable, high-speed connectivity to the central data store. This means that you would have to construct a Storage Area Network (SAN) then it would be used to connect the various sites to each other. In addition, you would still need to keep your existing WAN links so that users and remote sites would be able to access other types of data from servers located at the corporate headquarters.
What all this boils down to is the fact that you're going to have to make some big decisions weighing the cost of the cluster against its benefits. On one hand, creating this type of cluster would accomplish the goal of relieving the congestion on your WAN link and improving the end user experience. It would not however provide true fault tolerance to users in the remote locations.
You could use redundant hardware to improve the fault tolerance level of cluster nodes in remote locations. That way, if a cluster node were to fail then another server in the remote location would continue to function, and the users would probably be none the wiser that a failure had occurred.
The problem is that if the link between the remote site and at the corporate headquarters were to fail, then the cluster itself might as well have failed (at least from the perspective of users and the remote location). Â To provide true fault tolerance, you would need redundant SAN connectivity, which would increase the cost of the project exponentially.
Keep an eye on the big picture
With this in mind, let's take another look at our original goals. Originally, our goals for this facility were to relieve some of the congestion from the WAN link and to increase the speed of the end user experience. A geographically dispersed cluster could definitely achieve these goals if constructed properly. At the same time though these goals can also be achieved at a much lower cost by investing in a higher speed WAN link. If you wanted to implement a degree of fault tolerance for the users of the remote locations, you could even invest in redundant WAN links, which would still probably costs a lot less than creating a geographically dispersed cluster.
After reading this article, you might have gotten the idea that I am against the idea of constructing clusters. In reality though, nothing could be further from the truth. I just don't believe that a cluster is necessarily always the best solution for every project. If you're contemplating building a cluster, then it is very important that you weigh the cluster's benefits against its cost. You may find that constructing a cluster is cost prohibitive, or that a less expensive solution could just as easily meet your goals. Of course there are plenty of situations in which a cluster is the ideal solution to an IT problem. In Part 2 of this article series, I will continue the discussion of planning a server cluster.
