When cloud outages occur, a failover mechanism is usually activated to fail over from poor performing servers to healthy ones. Cloud services continue to operate without interruptions during the failover.
However, the failover mechanism doesn’t always work flawlessly. For instance, it doesn’t always tell you ahead of time whether there are enough resources for it to work. When computing resources get exhausted during the failover, the end result is a cloud outage (e.g., the Amazon outage).
During the outage, the Platform as a Service (PaaS) developers and Software as a Service (SaaS) users howl about not being able to meet work deadlines. In a desperate attempt to get the cloud services up and running, they keep calling the Information as a Service (IaaS) provider’s Help Desk. Eventually, the IaaS provider fixes the problem.
What to watch for when creating a failover algorithm
If you experience a cloud outage and are dissatisfied with the IaaS provider’s failover algorithm, you might decide to build a custom failover algorithm. You need to test the algorithm in different scenarios to make sure it works properly. Once all tests return positive results, you should get the IaaS provider’s permission to activate it in the production environment for the next round of cloud outages if the provider’s failover algorithm fails.
When you create a failover algorithm, these are three important pitfalls to avoid.
1: Leap year date
February 29, 2012 was the last leap year date. Someone forgot to check if the security certificate issuing server in Microsoft Azure could recognize that date. As soon as the clock ticked the first few minutes of that date, a virtual machine failed to start. It was a daunting task for the administrator to find and fix the problem.
The next leap year is February 29, 2016, so you have plenty of time to avoid the same mishap. You should test a few leap year recognition algorithms on the PaaS; this will help you ensure a security certificate will recognize the leap year date.
2: Unstable numerical algorithm
You discover too late that a numerical algorithm you created is unstable. The algorithm causes endless loops of consuming computer resources. As the resources for consumption shrink, the cloud service performance keeps slowing down. When there are no resources left, the cloud service stops operating.
Here is a simplistic scenario to help you better understand how a numerical algorithm could become unstable.
To solve the square root of two, you start with an algorithm with an initial approximation of 1.4. You set a very small value the algorithm should converge to. When this value is reached, the algorithm gives the approximated answer of 1.41421 (as expected). At this point the algorithm stops running; it is stable, as it releases resources for other computing tasks.
You create a slightly different logic in your new algorithm. You start with an initial approximation of 1.42 instead of 1.4. You discover the result doesn’t converge to a desired value — it diverges widely from the approximated answer that you obtained for the first numerical algorithm.
The answer is longer and longer. This algorithm continues in endless loops of eating up resources. It stops when there are no resources left — it is unstable.
To avoid this pitfall, do your homework to determine if the algorithm can converge to a desired value.
3: Hypervisor failure
All PaaSs (open or closed) sit atop the virtual machines that underlie the IaaS. All virtual machines are created and run by a hypervisor. The number of virtual machines that a physical server can host is determined by the capacity of a physical server.
When the hypervisor fails, all virtual machines go down. One reason for the failure is that the IaaS infrastructure specialist fails to determine how many virtual machines a physical server can host. The provider fails to check the accuracy of the server’s capacity. He attempts to add virtual machines beyond the limits of resources for this physical server. If the limit is two virtual machines and the provider adds one virtual machine, all virtual machines hosted by the physical server will stop running.
To avoid this pitfall, you need to figure out how many new virtual machines a physical server can host. Compare your findings with the IaaS provider or IaaS infrastructure specialist. Make sure you back up all virtual machines as a routine matter.
Don’t rely on the IaaSprovider’s failover algorithms. You can create your own failover algorithms, though remember to get the IaaS provider’s permission before you run them.