July 8, 2015 was a difficult day for IT staff at three high-profile companies, with the New York Stock Exchange (NYSE), United Airlines, and The Wall Street Journal (WSJ) all suffering overlapping outages.
The day started with reports that United had placed a ground hold on all US flights, eventually restoring flights about an hour later, with delays rippling through the air travel system for the rest of the day. Later that morning, the NYSE suspended trading due to “technical issues.” As related news alerts popped up on my desktop, I turned to my usual source of news, WSJ.com, only to find the site suffering from an outage, causing a brief concern that some sort of coordinated event might be occurring.
Later that afternoon, all three companies were back in operation. United eventually got passengers to their final destinations, despite stories of manual check-ins and connectivity problems; the NYSE outage left the financial markets largely unaffected, since NYSE-listed securities can be traded on other exchanges; and WSJ.com reappeared. There are major lessons IT leaders can learn from these outages.
1: Never underestimate communication
United, an airline already suffering from a difficult merger and poor on-time rating, failed to communicate quickly and effectively with customers after the ground stop was initially issued. As one would expect in 2015, customers sitting on planes and in airports immediately turned to the company’s website and social media accounts, only to find no mention whatsoever of the ground stop or current status. Most of the news about the stop came from concerned travellers or reporters stuck on aircraft, relaying messages from pilots and airport personnel.
Even if you have no idea of the root cause of a major service disruption, at a minimum acknowledge its existence where your most important customers are likely to turn for news. In most cases, that’s your digital presence on web and social.
In addition, as part of your disaster planning, practice communicating rapidly, clearly, and effectively. While this may sound foolish, most communications programs at large companies are designed to carefully control the release of information, with multiple layers of approvals. During a high-profile disaster, make sure you can actually get news out to your customers, and that your processes allow for flexibility. If you have an urgent communication that gets stuck in legal review, your customers will speculate the worst case in absence of an official communication.
2: Sweat the small stuff
In at least two of these outages, networking equipment was blamed. For many of us in a leadership role, networking and the “bottom of the stack” technologies that make communication possible are so reliable as to be readily ignored. The outages on July 8 should be a painful lesson that this is not the case, and that this equipment requires careful maintenance, redundancy, and skilled operators.
It’s always tempting to cut costs and staff in areas that are trouble-free; however, this can be foolish when millions of dollars are at stake.
3: Build the right response teams
Press reports about the NYSE outage indicated that three “war rooms” were immediately established: one for an executive team, one for a technical team, and one for a communications team. This can be an effective structure, although I recommend that some members of the executive team be placed in facilitation roles. If the technicians need access to emergency funds for a vendor, or if someone on the communications team needs to “blow up” a gatekeeper to issue an important message about the outage, they will need support from the very top. As an IT leader, you’re well positioned to fill this role.
Resist the urge to “play techie” and stare over the shoulder of some poor technician sweating over a console trying to debug the problem. Rather, ensure that person has what he or she needs in terms of organizational support.
4: Find, communicate, and fix the problem
When the smoke clears, be sure that any emergency changes or configuration tweaks are documented, or at a minimum noted down for later review and documentation, as all parties will likely be exhausted. Once everyone on the response team has a moment to catch his or her breath, start an investigation into what caused the problem. This is a great time to engage resources outside the team that installed, maintained, or fixed the troubled system or process, so that you get an impartial assessment of what went wrong.
With the root cause well understood, and any temporary fixes documented and made ready for production, develop a plan to prevent future incidents on this component. Avoid the urge to recriminate a staff member, team, or vendor, unless they directly circumvented a policy or procedure that led to the failure, in which case it’s incumbent on the IT leader to ensure that type of error cannot be replicated in the future.
If there is no long-term fix to the problem, and there is a systemic risk of future occurrences, you’ll now likely have a compelling example of the risks to maintaining the current systems and procedures; you should present a plan on what new tools and technologies can avoid outages in the future.
Planning and practicing communication and disaster management can make you prepared when an outage occurs, just as an effective post mortem can prevent future occurrences.
Note: TechRepublic and Tech Pro Research are CBS Interactive properties.