Complexity is a potential cloud killer: to deliver on the promises of scalability and high-availability, the architecture of cloud-based systems becomes ever more complex. Services must be made redundant and self-monitoring, data must be automatically replicated to multiple locations, and workloads must be balanced between multiple servers. As more and more critical applications are moved to the cloud, the risk of failure – and the cost associated with any such failure – increases as well.
As with traditional IT systems, the best way to minimize the risks is to put in place monitoring and alert systems. Unlike in these systems, however, the increased complexity of the cloud means that it can be impossible for anyone – be it a person or a team of people – to respond with the necessary speed to a problem. As services interact more and more, identifying points of failure and problems becomes harder, and human intervention becomes dangerous.
Hiring larger teams of people to manage cloud systems would mean abandoning the cost reductions associated with moving applications to the cloud. Reducing complexity, while a worthwhile goal, may not be feasible depending on the system and the architecture.
The only viable solution, then, is to rely on the automation of management tasks. With the evolution of cloud infrastructure, platforms and software, several specialized cloud server management solutions are appearing, and they can make our life much easier.
Monitoring and alerts, cloud-style
The best tools in this space are the cloud-based automated monitoring tools. These tools allow anyone to monitor multiple resources across several different servers and to set alarms based on any of the monitored metrics. The best examples of this kind are Amazon’s CloudWatch and CloudKick (which has recently been acquired by Rackspace). Both solutions allow for real-time monitoring of resources, and have out-of-the box support for several metrics. They also have rich visualization environments, and they are both extensible. CloudKick supports “plug-ins”, which are custom-built monitoring scripts that their agent executes, and CloudWatch has an API through which any application can post a metric.
The main difference between the two offerings is their scope. CloudWatch is focused on monitoring Amazon servers and services: while the API makes it possible to monitor any server, some custom development is required. CloudKick, on the other hand, can support most providers, since most monitoring is done by their agent, and this agent is available for many different operating systems. It can be easily installed to begin monitoring resources in any cloud server, regardless of the provider.
Another option is the cloud server management services offered by most large providers. While these aren’t really automation tools, they can make management of cloud servers much easier. Unfortunately, these services can be very expensive when compared to an unmanaged server. If we consider that, in order to offer true high availability, we may have to have multiple providers, the idea of having different people manage different parts of your infrastructure (many times with different service levels) is extremely unattractive and not very easy to sell.
From monitoring to self-healing and beyond
Once the monitoring solutions are in place, it becomes possible to automate many tasks that would fall to system administrators. There are many available cloud-based providers of alert systems that can be integrated with the monitoring solutions mentioned, and these can send alerts by e-mail, phone or SMS to anywhere in the world. But alerts are only the first step.
Through the APIs offered by providers such as Rackspace and Amazon, it is now possible to perform automated tasks to ensure the reliability of a system. If an excessive use of CPU or RAM is detected, for instance, it is possible to automatically scale up the servers where the system is running, so that it doesn’t become unavailable, while at the same time signaling the support team that something has gone wrong. In extreme cases, it is possible to take a problematic server off-line and reassign its IP address to a new server dynamically, without any human intervention.
A double-edged sword
As with most technology-based solutions, automation can be a double-edged sword. In failure situations, automated recovery processes can interact with catastrophic results. In fact, this is exactly what happened in Amazon’s April 21st outage. A configuration error resulted in an availability error, which triggered an automated response that ultimately resulted in massive unavailability. At the same time, the sheer number of elements that must be managed in a setup like that make human-based administration infeasible.
While it is possible to rely on traditional monitoring systems to keep track of cloud applications and services, they are not built for this purpose. Most aren’t even commercialized with this purpose in mind, and may lack some key features or have very large licensing costs. Automation and intelligent use of APIs is very important, and, once in the cloud, go deeper. Try to use the specialized cloud solutions. As the saying goes, if you’re going to get wet, you might as well go swimming.