High-availability is, ultimately, the holy grail of the cloud. It embodies the idea of anywhere and anytime access to services, tools and data and is the enabler of visions of a future with companies with no physical offices or of global companies with completely integrated and unified IT systems. Availability is also related to reliability: a service that is on 24×7 but goes constantly offline is useless. For a service to have true high-availability, it needs not only to be always-on, but also to have several “nines” (99.999[…]) of reliability.

It has long been the case that to build systems with this kind of reliability and availability means large costs for companies. For something like this, it’s not enough to simply have a failover cluster of servers in a data center: you must also have multiple redundant energy sources for the data center and even to have replication between multiple geographical locations in case of disasters. With the exception of very large, multinational companies, almost no-one could afford such a setup.

With the advent of infrastructure-as-a-service and platform-as-a-service providers, however, the costs of building such a service have decreased dramatically. It is now possible for most cloud-based service providers, especially for software-based services, to offer very aggressive service level agreements. Before getting there, however, it is necessary to understand what it means.

Understanding high availability

If you want a high-availability service, as a buyer or a seller, the first step is to understand what exactly it means. Let’s take a 99.99% SLA, for instance. In practice, this means that in any given month (assuming a 30-day month), the service can only be offline for about 4 minutes and a few seconds, or only about 50 minutes per year. If we look at most cloud service providers today, how many actually deliver on this promise?

There are a few questions to be considered here. As a buyer of the service, do I really need this service level? Am I willing to pay the extra cost that will be associated with this? Are the guarantees being offered enough to cover my expenses in case of failure? The last one is perhaps the most important and most difficult one, because several factors have to be taken into consideration. If your company is going to rely on cloud services to function, what happens when they go offline?

Earlier this year, several high-profile tech companies had troubles when Amazon’s EC2 service suffered an outage. The outage, according to Amazon itself, lasted for almost 11 hours. This is much larger than what would be acceptable according to their 99.95% SLA, and they offered a 10-day credit for all affected customers, but does this credit cover the true cost of the outage? As more and more services and applications go to the cloud, this question becomes increasingly important.

Standing on the shoulders of others

As cloud services and applications become more complex and more reliant on the underlying cloud platform, it becomes harder and harder to quickly identify and solve problems. Troubles can arise not only from an individual service, but from the interaction of multiple components and automated systems over distributed networks and data centers, resulting in issues that take a long time to be resolved. Regardless of the quality of a service provider, of their underlying hardware or platform, the chance of failure increases with complexity.

Understanding this complexity and the reliance on the platform is fundamental when defining the required availability level for a service. If you are building a service that needs 99.99% availability, you cannot simply rely on Amazon’s EC2, since they only offer 99.95%. It would be necessary to have a different host for that service, or even multiple hosts, to be able to achieve that.

The same thing goes to buyers: if your vendor offers 99.99% availability look at how well they have maintained this service level in the past, and look at the availability of the underlying platform. If you know that it will not be enough, make it clear. I’ve had clients who demanded that I have replicated servers on geographically distinct locations in case of natural disasters. It might sound crazy, but it may make sense, and if it does, I must be ready to do this.

This is where the existence of multiple providers on the cloud comes in handy. Services can be hosted and replicated to multiple providers, on multiple locations, to greatly reduce the chances of failure. Even downtime related to maintenance can be reduced by spreading a service over multiple providers: the chance of planned maintenance windows overlapping between providers is very small.

Putting it all together

As we’ve seen, there are several important factors that need to be considered when discussing availability on the cloud. Several large cloud providers, such as Rackspace and Amazon, offer very aggressive service levels – 100% uptime for Rackspace, 99.95% for Amazon – that are almost impossible to maintain, so we must ask ourselves what happens when the services fail? Are the credits we are entitled to in case of failure enough to cover the true costs of said failure? Is my service provider really capable of delivering on the promised availability level? What is his track record?

As cloud services mature, these questions become as important as price or other factors in choosing the right service provider. Answering them and understanding what is behind them becomes crucial so that everyone can trust and use cloud services without restrictions.