Of all the different elements that we need to consider when evaluating a cloud service provider, the service level agreement is probably the most important one. While many other factors are very relevant in such a choice, it's the SLA that establishes the baselines of the relationship between the consumer and the provider. The SLA is the contract that defines the limits of the responsibility your service provider is willing to assume regarding any service you are hiring - your data, servers, application, or anything else.
The SLA also defines what kind of compensation your service provider is willing to offer in case something goes wrong and they are unable to deliver on a promise they made, usually related to the availability of the service. It acts, in this case, as an insurance policy as well, that can be employed in case of trouble.
Since it serves many functions, reading and understanding the SLA that a cloud provider is offering is fundamental in order to properly work with them. A cloud application provider, for instance, may promise 99.95% availability of your data, citing multiple copies on separate locations to ensure this level. Some clients may take this to believe that their data is being backed up, and that they may ask to have their data restore at any point in time, when this is, in fact, impossible.
All SLAs follow a similar basic structure, with most of the same components. In this series of posts, we'll explore the main components of the service level agreements for cloud providers, and highlight the main issues and points of interest of each one. We'll start with guarantees.
Guarantees and Definitions
The first of these components are the service guarantees. This is where the service providers try to clearly define the services that are covered under the terms of the agreement, and then display what sort of availability they will offer for the services covered.
For infrastructure-as-a-service providers, the SLA will usually cover network availability, data center infrastructure, and the time they'll take to bring an instance back online in case of a problem. It will also usually cover issues with the underlying platform for the cloud servers, such as the hypervisor software and storage units. Some providers, such as Amazon and GoGrid, lump all these items into a single "server uptime" or "instance uptime" metric; others will monitor and measure each item separately, and offer different guarantees for each one.
Infrastructure providers, under their standard service plans, will never offer any guarantees related to software running under their virtual machines. This means that if a VM stops working due to a software failure (such as an operating system problem), it does not constitute downtime.
For platform-as-a-service providers, the items covered by the SLA are focused on the functionality being delivered. For a cloud database, the SLA will talk about availability of the database in terms of the period of time it was available for the customer; for process execution, it will cover the external connectivity of the process executor; and for storage, it will cover the correct processing of storage commands, such as saving or retrieving data.
These availability measures are somewhat similar to the ones on the infrastructure layer, but more focused on the functionality, and they try to make the underlying infrastructure invisible to the end user. Platform providers will focus on the availability of the services they offer, especially on your ability to access them and their ability to reply properly to requests.
Finally, for software-as-a-service providers, the SLA will usually cover application and/or data availability. A 99% availability at this layer means that users will be able to access the application at least 99% of the time; the same goes for data availability. The availability numbers for this layer are probably the most important ones, because they will impact the end-users the most. While at other layers you may be able to increase the overall availability of a system by relying on multiple providers, it may be impossible to do this here.
One interesting thing about availability is that, regardless of it being an infrastructure, platform of software provider, the promised level will usually hover somewhere between 99 and 99.99%. While there are some providers who will promise 100% availability, this is a futile promise. Not only is it impossible to keep in the long run, but they will also be restricted by limitations on the way the availability is monitored or by the definition of downtime.
If the availability is measured on a monthly basis, having a 99% guarantee means that a service may be offline for 7.2 hours on a given month. This is practically a full business day. If it is measured on a yearly basis, it could be offline for 3.65 days. Understanding the services covered by the SLA and the availability level being offered is fundamental in order to be able to prepare operationally for any trouble that might happen with your service provider. In the next post of this series, we'll discuss the compensation that providers offer for when they don't manage to keep their promises, which is something that sooner or later ends up happening.
After working for a database company for 8 years, Thoran Rodrigues took the opportunity to open a cloud services company. For two years his company has been providing services for several of the largest e-commerce companies in Brazil, and over this time he had the opportunity to work on large scale projects ranging from data retrieval to high-availability critical services.