Well, its been a
busy week. We learned a few things, stuck to the plan as much as we
could, and went out with a bang. Fortunately no one got seriously
injured in the process, though I had to drop my pride in a bucket
again. And again. And again. They tell me it’s good for me.
You see, we had
several issues arise this week. The fact that the issues themselves
arose doesn’t come as a shock to anyone; life as an IT person tends
to involve incidents of varying magnitudes. However, the concept in
my last blog came into play heavily.
issues to our attention which we then address first though
workarounds and then though problem analysis and resolution. You
find this method, described different ways, in all kinds of customer
service and operations approaches. What they don’t talk about is the
catalyzing effects of incidents on the general systemic problems
Let’s follow the
logic. Many incidents begin as random events. You load patches
you’ve loaded a dozen times before and the server bombs out. You try
to install software on a server image already in production on fifty
servers and the installation generates a run-time error. A user
calls in because a Windows server’s print spooler decided to stop
of the event, though, play out within the context created by history.
The server which bombed out may be part of an ongoing dispute
between two departments, a dispute raging over half a decade
regarding the importance of a system which accidentally became both
mission critical and without a business continuance plan.
incident by working around the issue requires good technical skills.
You need to be levelheaded, determined, and willing to take some
risks. You need to know when to wait, when to gather information,
and when to plunge forward knowing you might not have THE ANSWER but
that you have an answer which might work. Mastering those skills
takes quite enough time and patience from anyone.
Now add to that
the understandable panic of people caught without a business
continuance plan. So now the business has stopped, enterprise-wide.
The blame gets laid at your feet, not to mention the heat you
personally feel because, despite what others may think, IT folks
generally want to do a good job. We don’t like taking down systems,
especially after our test showed we wouldn’t.
Now add on top of
the previous two points that the system, due to political
compromises, never performed well. Your leadership knew it was going
to go down disastrously at some point. It went down twice a week in
a modestly controlled fashion before four months of tuning brought it
into an almost stable configuration. So now a fractured management
team must face their worst nightmare; they gambled and lost. Big
time. How they respond will tell you a lot about who they are and
why they do what they do.
Just to put a pin
in it, make it so the system drops on the evening before another
system is about to re-pilot after being down for a month for repairs.
The system that dropped is in the same group as the other system, so
in the end the same managers become ultimately responsible.
It’s not that all
of our hard-won technical skills mean jack. It’s just that, in the
anatomy of a disaster, the actual technical incident only modestly
affects the events as they unfold. We can resolve the incident,
create workarounds for the issues exposed, and resolve the problems
yet still lose the war. An incident may close in a matter of hours,
yet it will reverberate for months as structural issues play
So, what do we do?
As a leader I try to keep the non-technical elements of the disaster
off my team’s plate. As a manager, accidental or otherwise, I try to
keep the team focused on activities leading to greater operational
stability. As an IT architect with training in risk mitigation and
system design I feel deep and abiding pain every time I look at the
way the organization chose to deploy some systems and support others.
That last is just pride, though, and does not belong in the role I
Thus back to my
comment about swallowing pride. I don’t make decisions; I just try
to keep people from getting destroyed by the choices they make.
Sometimes my team can pull it off. Other times we’ve got too much to
do and not enough time to follow though with the basic mantra of
incident to issue to problem I laid out above. Without each part of
the team playing their part we really cannot do it.
But no one wants
to hear about the problems. It’s all just excuses. The user
community does need help and they need it now.
Ah, Monday’s going
to be great fun.