One of the biggest issues in any organization is when an application goes down. Downtime can lead to a bad user experience and, ultimately, cost the business money.
At a breakout session during the 2017 Google Cloud Next conference, Google’s director of customer reliability engineering, Luke Stone, explained some of the top reasons for downtime and how developers and IT can fight them.
Here are 10 common causes of application downtime and how to avoid them.
SEE: Scheduled downtime policy (Tech Pro Research)
Overload, very simply, is “when demand exceeds capacity,” Stone said. One way to respond is with load shedding–work to shed the excess load before it crushes the application. When at capacity, return errors before doing any work. Calculate the capacity of your service and determine what you can handle at the application server, and then serve an error to every request past that number.
2. Noisy neighbor
Users can bring spam, or another workload (backups, etc) can become a noisy neighbor. In this instance, you can’t shed the load evenly, because it will unfairly affect normal users. To deal with noisy neighbors, Stone recommended putting in limits, like request limits for each user to cancel out spam bots and bad actors. You can also try to limit by IP address, but that could affect good users. So, make sure to make it configurable so you can easily help people who were wrongly affected.
3. Retry spikes
When you start rejecting users, they’ll start retrying because they don’t know why they were rejected. Clients can’t tell the difference between a single failure and a broken service. These retry spikes can lead to a cascading failure, Stone said. To fix this, implement aggressive backoff and try to guess when the service will be overloaded so you can plan for that.
4. Bad dependency
In this issue, suddenly dependency gets really slow and requests are piling up. This problem happens when your application’s input and output aren’t communicating correctly. Your client must become a “defensive driver” so it doesn’t overload your backend systems. You can also approach this problem with dynamic load shedding, Stone said. In the future, the organization should simulate a disaster scenario with sudden, massive load tests to figure out where the pain points will be.
5. Scaling boundaries
When an organization wants to serve more requests, they can often run into issues as they try to scale. This can cause a bottleneck between the application you’re building and a service (like an API) that it is relying on. To solve the problem, Google relies on what Stone called “sharding,” where a consistent workload is broken up into little chunks that can be done separately. Stone recommended sharding early and using more shards to increase capacity.
6. Uneven sharding
This occurs when one shard becomes more busy than the others, due to popularity or hot spotting, Stone said. To fix this you’ll need to reshard, which can be done by splitting shards that have gotten too big, or using a shard map to determine which shards will become the most popular. If possible, leave the non-problem shards alone when you reshard, Stone recommended.
This is derived from the “cattle, not pets” phrase. The death of a pet is sad and hard to deal with, while the death of cattle is the cost of doing business. A “pet” is a workload or system that has too much human involvement, or is regarded as too important. Pets needs to be acknowledged and documented, so people other than the “handler” feel encouraged to work on them.
This is the no.1 case of outages in every organization that Stone has worked in, he said. The best technique for handling outages caused by deploying bad code is having a good way to recover if you do. A tried and true rollback process will help, Stone said, but you must know when a deploy is happening. Also, progressive rollouts can help by being slow enough to detect problems that emerge slowly, Stone said. A good ramp up strategy is to go from 1%, to 5%, to 10%, to 50%, to 100% in stages.
9. Monitoring gaps
Many users have monitoring gaps between what the user is experiencing and what the system is telling them. Certain issues like how many errors you’re serving, or your average latency can help better predict what the user is seeing. If you detect that users are unhappy, then you can further move down to checking your CPU, RAM, connection errors, and other metrics to try and determine what’s wrong.
10. Failure domains
Stone described failure domains as a “chunk of your infrastructure that can fail altogether at the same time.” Examples would be a VM, a certain zone, or even a certain region. To mitigate this issue, think ahead of time and try to determine which of these areas could be your failure domains and put your backups and data elsewhere. However, these domains should be used for rollouts and disaster testing so if the problem is caused by a downed zone, for example, you know it might not be an application issue.