I’m a self-proclaimed cloud enthusiast. I make heavy use of the cloud in my business today, both to fulfill internal needs as well as to better serve my customers. And, for the most part, I’m very happy with how it’s working out for me. The cloud has allowed me to reach a level of availability and service that would be impossible to sustain if I was going the traditional software route, and, in my case, has actually resulted in significant cost reductions.
Being an enthusiast, however, doesn’t mean that I’m blind to the troubles that exist on the cloud today. For me, the greatest danger lies in the optimistic service level claims made by most providers around. For all the claims about 100% uptime, credit-based guarantees, and premium support floating around on vendor’s websites, the true reliability metrics still fall short of these promises.
Only this past month, I’ve had incidents with a couple of providers that resulted in significant server downtime and credits being added to my accounts due to the SLA not being fulfilled. But while I was affected by these issues, my customers were, for the most part, completely unaware of what went on. Their systems kept running along just fine, without any downtime. As with any other situation, by hoping for the best but preparing for the worst, you can make sure that any problems that happen - and they will happen - are kept under control. So, how to proceed when servers go down and customers start complaining?
If trust is the currency of the web, transparency is what ensures your coffers are going to stay full. When trouble happens, make sure you let everyone know what is going on. Give constant updates during a crisis, and make sure that everyone stays on the same page. It’s always attractive to try and minimize problems, or somehow try to hide the true causes behind them, especially if they relate to some mistake made by your team. The truth always comes up, however, and when it does, all you’ll have are angry customers deserting you.
The way you manage communications with customers when problems start happening can make all the difference in the world to your reputation, not only with the customers themselves, but also with your providers. During last year’s major Amazon EC2 outage, for instance, several high-profile companies saw their services go down. Instead of trying to hide the problems or blaming it all on Amazon, they acknowledged the issues and at the same time praised Amazon for doing the same with them and for being an excellent business partner.
Having a direct channel to customers is fundamental, and social networks can be a big help here. Most top providers have Twitter or Facebook accounts where they post the status of their services. This creates a place where anyone can go to see the latest status information, reducing the chance of uncertainty spreading. Furthermore, active communication once any problem is detected is fundamental. Reach out to your customers over e-mail or any other means of communication necessary, and ensure that they stay informed of everything that is going on. Another very useful tool here is the status panel, which, once again, all top providers offer.
The best way to ensure maximum transparency, however, is to make sure you learn of any problems before your users. For this, automated monitoring tools are a prerequisite. If you are relying on cloud servers and haven’t set up automated monitoring on them, stop reading this and go do it now. News of service failures spread quickly, especially on the web. While you can get away with users detecting problems before you one or two times, if this situation repeats itself over and over again, your reputation will suffer. If you are the first to warn everyone, however, people will be much more receptive, and your reputation may actually improve.
The most fundamental aspect in handling any crisis is preparedness. If you have an action plan ready before a problem occurs, you can apply it and adjust it on the fly as necessary to handle any situation that might happen. There are several dimensions to being ready for trouble on the cloud. First the basics: make sure you have automated backup routines up and running. While this seems obvious, it isn’t. Several people rely on their hosting provider to do the backups, not knowing that sometimes the provider will not do any kind of backup because of the disk size or instance type. Make sure that everything is going as expected, to avoid headaches later.
Second, you should have a thoroughly tested disaster plan in place. What happens when your main server goes down? How long would it take you to get back to up and running from zero? Do you have a server image saved that can be quickly brought on-line in case of hardware failure? It might be interesting, for instance, to have a saved image even if you operate on a physical server so that, in case it crashes, you can bring a cloud machine on-line to reduce downtime.
Simply having a procedure in place is not enough. The best thing is to try out your disaster plan every once in a while. Several companies today are already employing automated tools that throw wrenches on their systems in order to try and bring everything down. This kind of drill is fundamental not only to better detect possible points of failure, but also to make sure that everyone knows what to do when problems happen.
Finally design your systems to be as robust as possible. The cloud gives us access to a much wider pool of computing resources; make sure you use them. It is more interesting to have an application that runs on two load-balanced servers, where one can take over if the other was to go down, than to have everything running on a single, more powerful server. Bu relying on wider distribution and failover mechanisms, it is possible to reach a higher availability rate than any single cloud provider would be able to give you.
The cloud is going through an “overhype” phase. Every IT vendor is coming out with a “cloud something or other”, and selling the cloud as the solution to all problems for end-users. Everyone claims that you’ll never have to worry about backups, server downtime, or any kind of low-level IT issues ever again. Companies, especially small and medium businesses, come to the cloud with unrealistic expectations, assuming that they will never have to worry about IT ever again.
Having unrealistic expectations and not understanding that all the traditional IT worries still need to be considered are the biggest mistakes any newcomer can make. The cloud is like any other IT environment: it has its benefits, its drawbacks, and is as prone to failure as anything else. I am, and always will be, a great cloud enthusiast. I’ll remain using cloud servers and software in my business. But my enthusiasm is tempered by a dose of worries about potential problems. The best advice for anyone making the jump into cloud computing is to ignore the hype and keep potential issues always in mind.
If you have any cloud horror stories, or tips on how to be better prepared for troubles, please share in the comments.