Earlier the week, Delta Airlines experienced a massive IT outage that resulted in a global ground stop of all Delta flights and widespread delays and rebookings, with about 1,300 flights cancelled as of this writing. Costs are estimated in the millions of dollars, and at this point there is little known about the source of the outage, other than a “power outage” at Delta’s Atlanta data center that the company implied was the fault of the local utility, but recent information seems to indicate was a fault in Delta’s power distribution infrastructure.

This follows a string of high-profile IT outages at major airlines, each resulting in significant disruption to airline operations, and a seven or eight-figure direct cost to the airline to recover, and associated “soft costs” in damage to the airline’s reputation with the travelling public. The latter is especially acute for Delta, which has prominently featured its on-time performance in its marketing materials and brand story.

IT spaghetti

It’s easy to blame this outage on IT operational issues and assume this is a routine failing of IT management or operations execution. I’ve been in my share of data centers where high-end power backup systems were diligently installed, but the “replace battery” light had been blinking for months, or redundant systems were never actually tested, and improperly configured such that they would fail in an actual outage. However, in the case of the airline outages, and in many other industries, IT systems are becoming too complex to reliably maintain, even with the best disaster recovery plans and testing. For a basic benchmark of the complexity companies are now dealing with, an outage at Southwest Airlines was solved with a complete system reboot, a process that took 12 hours to complete.

I’ve been in countless meetings where someone proudly displays what I call the “spaghetti chart” meant to depict IT applications and their interrelations, which is usually a mess of hundreds or thousands of boxes, connected with an indecipherable muddle of lines. If the visual chaos of these charts were not enough, within moments a spirited debate usually arises as to the accuracy of the chart, clearly demonstrating that most IT shops can’t even agree on their inventory of applications or their relationships. In all but the rare case, this is due to complexity wrought over decades of mergers and upgrades, rather than lack of competence by IT staff.

The cost of complexity

The origins of this complexity are relatively obvious. Even small or mid-sized organizations may have gone through a merger or acquisition or two in their history, and have likely bolted together disparate systems with years of interfaces and middleware. In fact, much of modern IT is centered on integrations, and with ever tighter timelines for delivery, it’s often easier to build an integration layer atop an aging system that no one understands than it is to evaluate whether the process it supports can be simplified or rebuilt to make it more robust.

Efforts to reduce complexity, like application portfolio rationalizations, are usually placed low on the priority list, since they’re expensive and, when executed perfectly, are completely transparent to end users, save for a large bill for the effort.

The historical problem with reducing complexity was a high cost, with a low perceived return. ROI calculations usually centered around reduced hardware and support costs from eliminating redundancy and complexity, which rarely covered the high cost of replatforming, rebuilding, or retiring a legacy application. However, incidents like Delta’s readily demonstrate the very real costs of IT complexity in lost revenue. A company need not be a complex, time-sensitive operation like an airline to experience this challenge; I once worked with a client where an overly complex invoicing process rendered an acquired division unable to invoice customers, creating a cash flow and revenue problem that ultimately affected the quarterly results and share price.

The beauty of simplicity

For too long, we’ve tolerated increasing complexity in IT. Entire sectors of the industry exist to integrate systems that should have been disposed of, and we celebrate the technician who can “make it happen” rather than considering whether “it” has become so complex that it represents a critical point of failure down the road. Complexity can even define our existence as technicians; after all, it feels great to do the “impossible” and win the accolades of your peers, rather than forcing a difficult discussion on whether a business process should be modified because it introduces too much risk into the supporting infrastructure.

Consider the cost of any complexity you introduce as you build complex new integrations, or take the fast and complicated route without fully understanding the cost of the risk you’re introducing. A few dollars saved with the fast and complicated will be rapidly erased when your business is unable to function. Use incidents like Delta to clearly convey the cost of risk, and study where your existing IT infrastructure has become so complex no one really understands the impact of a potential failure, from a blip in power distribution, to a single router failure. While your peers on the business side may not know an ESB from a UPS, they can readily understand the impact from your business being unable to perform its core functions.

Also see:
Massive Delta outage highlights need for quality data center power, backup plans
Plan like a startup: How IT can abandon constraints of legacy systems
Four painful IT lessons from the NYSE, United Airlines, and WSJ outages