The goal of just about every enterprise-level data system on the market is the elusive "five nines"—providing 99.999% perceived uptime. Since achieving the five nines reduces your unscheduled downtime to only a few minutes per year, most applications simply cannot provide anywhere near the level of protection and redundancy required to reach this lofty goal. Web sites, however, could conceivably be configured to provide a redundancy level approaching (if not meeting) 99.999% uptime—if you’re willing to create and maintain the proper set of primary disaster recovery (DR) and high availability (HA) systems.
Defining the five nines
Before we go into how to achieve the holy grail of high availability, we should first define five nines of uptime. The measurement of five nines is for perceived uptime, meaning that if a client makes a request to a data system, the client is presented with an appropriate response. In this case, they get whatever Web page they were expecting to find. Since Web servers and Web services can be distributed, the easiest way to ensure perceived uptime is to have more than one server that handles the same data, with some load-balancing system in place to make sure that if a server fails, requests are shuttled to the next available balanced machine.
Five nines also pertains only to unscheduled downtime, so you can—and absolutely should—schedule regular maintenance windows to allow for upgrades, hot fixes, and software updates.
Now that I’ve defined the five nines for Web sites, I can start discussing how to achieve the goal. As mentioned, step one is to load balance multiple servers responsible for the same data. Companies like Cisco and F5 offer hardware-based solutions for providing load balancing regardless of how big your Web server farm gets. Microsoft—with Windows 2000, and more so in Windows 2003—provides built-in load balancing for Internet Information Services (IIS) for small to midtier Web farms. The Datacenter Edition can even cover enterprise-level Web farms, but at a very high cost. If you’re using Linux or a variant of UNIX, Apache and other Web servers can load balance themselves to an extent higher than that of Windows, but still well below the level of control a hardware-based balancing solution would provide.
Load balancing will help to ensure that even if one or more servers go down on you, a server will still be available to handle incoming requests. There are, however, limits to the technology. The most important limitation is that state-aware processes, like CGI scripts and server-side includes, will lose their data when the server goes offline, so clients will have to restart the process they were working through when they reconnect to the next load-balanced server. If data for these servers is kept on storage devices that are replicating data between themselves, the transactions the user has already completed will be safe, but any transactions being worked on during a fail-over will probably be lost.
Clustering is another technology that can help to protect Web sites. Windows clustering can handle anywhere between two and eight nodes, depending on the version and level of the operating system you’re working with. Linux and UNIX clustering is limited only by your technical expertise and the hardware and software you’re working with. The larger the cluster, the higher the cost, so keep in mind that your 30-server UNIX-clustered Web farm will be very pricey, though extremely redundant. The benefits to most clustering technologies include shared disk resources and automated failover for resources that go offline unexpectedly. The drawbacks are shared disk resources and resources that automatically fail over, even if they don’t send up alarms that something went wrong. There are many products specifically built to get around these drawbacks, and built-in features to rein in auto-failover and some other annoying issues that drive many enterprises away from clustering technologies.
However, since clustering doesn’t offer a gigantic improvement over load balancing when it comes to Web servers, many organizations choose to load balance instead of clustering at all. The determination of which of these local high availability solutions you will choose is dependent primarily on if you have the hardware required to cluster, the level of expertise of the staff in charge of the systems, and the preferences of the organization’s policies on HA and DR. Usually, some combination of both technologies is your best bet, with the most sensitive (or most disaster-prone, depending on your point of view) of your systems getting put onto load-balanced clusters for additional redundancy. Utilizing tools that create clusters that do not share a disk array (available for Linux, UNIX and Windows) will further bullet-proof the production location systems against failure due to any single thing going wrong.
So what happens if a lot of things go wrong at once? For example, let’s say the CEO is celebrating the opening of the new data center by lighting a cigar, which sets off the fire suppression systems. It doesn’t matter if you have water suppression or some form of fire-control gas system, the data center is coming offline in a major hurry. We no longer live in a world where the luxury of a single data center is allowable for an enterprise-class Web farm. The time of Remote Availability (RA) is here.
RA is the concept of HA stretched to different data centers—usually beyond the line-of-sight horizon of the production facility, so that natural or man-made disasters of some significant scale can still be survived. RA poses many challenges, not the least of which is simply failing over the Web farm to another location when needed. Web farms are, by their very nature, highly dependent on things like DNS records and IP addresses. While load balancing systems in both locations can minimize the number of IP addresses to keep track of, you will still have to get traffic flowing to the DR site in order for clients to connect to the failover Web servers—and you have to do it very quickly in order to preserve your uptime. There are a few ways to do this, both internally and in conjunction with your Internet Service Providers (ISPs).
Internally, you can make sure you have the ability to reroute DNS requests whenever necessary. Keep your Time to Live parameters very short, so that clients are forced to check in with DNS on a regular basis (since some defaults are 72 hours, your uptime will be shot if you don’t modify your TTL attributes). What to set this to will be dependent upon your organization and how often and why people access your site, but a value of 20 minutes will probably strike a good bargain between bandwidth use and recovery time. Once the TTL is set, you will need to make arrangements to switch the DNS entries of the Web domain over to the new site on short notice. Since most enterprise operations do DNS routing in-house, this generally tends to be a policy change issue, but keep in mind you may need to talk to ISPs or other departments if they control your DNS records. In all cases, remember that the DNS system is highly distributed, and in spite of your best efforts, there will be a small number of clients who will have some trouble making the move to the new IP range automatically. Be ready to offer advice for flushing DNS caches when you fail over.
If your ISP can work with you directly, setting up a VPN between the two data centers will eliminate the need to perform DNS changes in most cases. Set up the load balancing systems in the DR data center with the same IP addresses as those in the primary data center, but keep the DR site virtual IPs dormant until failover is required. Your ISP can reroute connections from one data center to the other in the event of a disaster (you may need to contract this separately), which means you need only bring up the virtual IP addresses in the alternate site, and you’re off and running again. While I'm on the topic of ISP interaction, now may be a great time to remind everyone that redundant pipes to the Internet from different ISPs are considered mandatory for any shop considering high availability. Shop around for the best deals, and remember that you can usually make do with a lower bandwidth pipe in an emergency.
Of course, all these solutions require that you have an identical, or nearly identical, setup in the DR center, and that you’ve been replicating data over to the alternate site. There are a large number of tools that can replicate real-time between data centers, both host-based and hardware-based, for Windows, UNIX and Linux. Using these tools will keep the data up-to-the-minute in your DR data center, and may even assist with failover by helping the DR machines to assume IP addresses, server names, and start up services and daemons.
Remember that replication requires bandwidth, and the amount you need is largely dependent on how much data changes each day and what type of replication you use. On the whole, host-based systems that can replicate at the byte-level use far less bandwidth than hardware-based systems. The drawback is that nearly all host-based systems are asynchronous, meaning that there’s a chance some data will be lost in-flight if a disaster occurs unexpectedly (and disasters don't usually occur on a schedule). However, since nearly all Web sites can easily recover from this minimal loss of data, host-based solutions are more than adequate for all but the most sensitive financial-type Web site applications and their data.
Many tools can also support many-to-one failover for applications (where the app will support it) and can therefore help keep costs down overall. Several of these replication solutions also offer the ability to create cluster-like setups without using the clustering technologies provided by the operating system themselves. This means that if corporate policy prohibits clustering for some reason, you can still get both local and remote availability via these software solutions.
Five nines of perceived uptime is not impossibility, if the budget and brain-power required to put together the most effective solution is available. Working within the structure of your enterprise policies and procedures, you can create a solution that gives you the protection you need without either breaking the rules or breaking the bank.