CXO

You built it, you fix it: Developers move to frontlines of IT incident response

The 2015 State of On-Call survey says operations is no longer the main responder to tech problems. Here's why.

incidentresponse.jpg
Image: iStock/SINGTO2

From an unresponsive application, to a critical outage, to disaster recovery, if IT drama strikes and you're on-call, it's your problem.

Depending on the setup, being on-call can mean a 24-hour shift that can drag out for days. For those in IT with on-call duties, they'd probably tell you it's not the most fun part of the job. But, as they say, someone's got to do it.

Though, who exactly is on the hook for incident response is changing. According to the State of On-Call report, recently released by incident management platform VictorOps, more organizations are shifting the role of rapid responder from operations to developers.

More than 600 respondents took part in the 2015 survey, coming from industries like software, Internet service, and media & entertainment. More than half (54%) of companies surveyed have on-call teams consisting of two to five people, and team members are usually on-call for one week at a time.

TechRepublic spoke with VictorOps evangelist Jason Hand via email about why developers are increasingly finding themselves on the frontlines of IT incident response, as well as what's behind other trends facing those on-call.

Operations is no longer the first to handle IT issues

Hand said there are a few factors driving this trend, like the understanding of SaaS design, as well as the the rise of development and management philosophies like Agile, Lean, and DevOps.

"These new view ways encourage shortened feedback loops, extreme collaboration, a focus on continuous learning, and most importantly, the idea of co-creating value of the service," Hand said.

The end goal is to make systems more resilient to outages. Operations has to change its focus from preventing failure to responding and repairing issues quickly. To do that effectively, Hand said, team members who were involved in designing and building a system need to be part of the response and resolution of incidents involving that system.

Being on-call is improving somewhat

One respondent described being on-call like this: "It's about the most painful and stressful thing a job can ask you to do."

Though, of the 600 respondents, 80% indicated that problems with being on-call are getting better or are "sort of" getting better.

Hand said there are a few reasons for improvement, however minor:

  • Rich notifications and the ability to quickly work with one's team on resolution
  • The ability to immediately jump on a control call and have all relevant parties notified at the same time
  • Better documentation and improved monitoring to catch problems earlier
  • A comprehensive inventory of servers, applications and relationships between them
  • Internal knowledge base with known solutions to recurring incidents that have not been moved to problem management yet
  • Tuning alerting to quiet the noise, Self-healing monitoring

Alert fatigue is up

One significant problem for those on-call is alert fatigue resulting from constantly being paged for non-actionable alerts. In fact, 70% of respondents said it's their primary pain point associated with being on-call.

Other pain points from the survey include:

  • Unawareness of what is happening across all systems at the start of an incident
  • Enterprise leaders not having a context around what's happening in a firefight
  • Team members not having the relevant information to resolve the problem

"On-call is only as good as whatever is generating the alerts," said another respondent.

Homegrown solutions are out

In one of the more dramatic results from the report, the percentage of companies using homegrown incident response solutions dropped from 70% down to 25%. The ready availability of SaaS-based software solutions could, in part, account for the dip.

The cons of homegrown incident management solutions include:

  • Not being able to scale it easily
  • Not knowing how much you're spending in time and money
  • The inability to pull reports

ChatOps and collaboration are important

Up from 28% in 2014, 40% of respondents said they rely on ChatOps, also known as conversation-driven development, during an incident. Common tools used for ChatOps according to the results include Flowdock, HipChat, Hubot, and Litabot.

"The main influencers in the growth of ChatOps is the adoption of services such as HipChat, as well as the fact that most services used include API functionality that can be leveraged via these chat services," Hand said.

Hand also listed the following benefits of ChatOps:

  • Increased transparency
  • Cross-functional knowledge sharing
  • Collaboration on problems
  • Increased speed of conversation and decision-making—because they are synchronous conversations

Post-mortems help you work smarter

When asked what they use in order to improve their process, on-call responders put post-mortems at the top of their lists. Half of respondents said they have a defined process for conducting them, and the other half did not— 66% said they only conduct post-mortems after significant outages.

Other approaches included documentation, reporting, and alert accuracy.

Also see

About

Brian Taylor is a contributing writer for TechRepublic. He covers the tech trends, solutions, risks, and research that IT leaders need to know about, from startups to the enterprise. Technology is creating a new world, and he loves to report on it.

Editor's Picks

Free Newsletters, In your Inbox