Innovation

How Incidents Become Disasters

Well, its been a

busy week. We learned a few things, stuck to the plan as much as we

could, and went out with a bang. Fortunately no one got seriously

injured in the process, though I had to drop my pride in a bucket

again. And again. And again. They tell me it's good for me.

You see, we had

several issues arise this week. The fact that the issues themselves

arose doesn't come as a shock to anyone; life as an IT person tends

to involve incidents of varying magnitudes. However, the concept in

my last blog came into play heavily.

Incidents bring

issues to our attention which we then address first though

workarounds and then though problem analysis and resolution. You

find this method, described different ways, in all kinds of customer

service and operations approaches. What they don't talk about is the

catalyzing effects of incidents on the general systemic problems

Let's follow the

logic. Many incidents begin as random events. You load patches

you've loaded a dozen times before and the server bombs out. You try

to install software on a server image already in production on fifty

servers and the installation generates a run-time error. A user

calls in because a Windows server's print spooler decided to stop

working.

The repercussions

of the event, though, play out within the context created by history.

The server which bombed out may be part of an ongoing dispute

between two departments, a dispute raging over half a decade

regarding the importance of a system which accidentally became both

mission critical and without a business continuance plan.

Resolving the

incident by working around the issue requires good technical skills.

You need to be levelheaded, determined, and willing to take some

risks. You need to know when to wait, when to gather information,

and when to plunge forward knowing you might not have THE ANSWER but

that you have an answer which might work. Mastering those skills

takes quite enough time and patience from anyone.

Now add to that

the understandable panic of people caught without a business

continuance plan. So now the business has stopped, enterprise-wide.

The blame gets laid at your feet, not to mention the heat you

personally feel because, despite what others may think, IT folks

generally want to do a good job. We don't like taking down systems,

especially after our test showed we wouldn't.

Now add on top of

the previous two points that the system, due to political

compromises, never performed well. Your leadership knew it was going

to go down disastrously at some point. It went down twice a week in

a modestly controlled fashion before four months of tuning brought it

into an almost stable configuration. So now a fractured management

team must face their worst nightmare; they gambled and lost. Big

time. How they respond will tell you a lot about who they are and

why they do what they do.

Just to put a pin

in it, make it so the system drops on the evening before another

system is about to re-pilot after being down for a month for repairs.

The system that dropped is in the same group as the other system, so

in the end the same managers become ultimately responsible.

It's not that all

of our hard-won technical skills mean jack. It's just that, in the

anatomy of a disaster, the actual technical incident only modestly

affects the events as they unfold. We can resolve the incident,

create workarounds for the issues exposed, and resolve the problems

yet still lose the war. An incident may close in a matter of hours,

yet it will reverberate for months as structural issues play

themselves out.

So, what do we do?

As a leader I try to keep the non-technical elements of the disaster

off my team's plate. As a manager, accidental or otherwise, I try to

keep the team focused on activities leading to greater operational

stability. As an IT architect with training in risk mitigation and

system design I feel deep and abiding pain every time I look at the

way the organization chose to deploy some systems and support others.

That last is just pride, though, and does not belong in the role I

currently play.

Thus back to my

comment about swallowing pride. I don't make decisions; I just try

to keep people from getting destroyed by the choices they make.

Sometimes my team can pull it off. Other times we've got too much to

do and not enough time to follow though with the basic mantra of

incident to issue to problem I laid out above. Without each part of

the team playing their part we really cannot do it.

But no one wants

to hear about the problems. It's all just “excuses”. The user

community does need help and they need it now.

Ah, Monday's going to be great fun.

Editor's Picks

Free Newsletters, In your Inbox