Untangling the web of social elements in technical troubleshooting

For most users, even in today's sophisticated workforce, IT infrastructure is a closed box in which unknown events occur that produce results. This can allow IT to be the scapegoat for any and all ills.

As IT professionals, we approach programming and system problems as understandable, solvable issues. For our clients, IT problems take place in a black box, enabling them to blame the system for anything. That can become an excuse for other behaviors that fundamentally come from the social or political spheres. After years of arrogantly dismissing my clients as foolish, a blatant example of this phenomenon finally opened my eyes.

On this particular project, I worked as a messaging subject matter expert for a company of around 600 employees. The project encompassed a messaging backbone upgrade and some server configuration modifications. The teleco team worked on a parallel project to upgrade several locations' phone and voice mail systems. After extensive testing and a few false starts, we finally rolled the backbone out over a weekend. When two weeks went by without any unusual problems, the IT manager and I shook hands. I departed with the sense of a job well done.

A month later, I arrived back on the client site early Monday morning. It seemed that the new mail system had failed to live up to expectations. Routed mail, used to approve various sales contracts, could take up to a week to arrive at the first person on the routing list. The programmers swore their code worked perfectly. The infrastructure team insisted the backbone passed messages in less than a second. The CIO wanted the problem resolved, so he had asked my company to send me back.

I spent my first few days running various use cases to try to narrow down the problem. Everything looked good. Mail routed to the first recipient less than two seconds after initiation. Signing did not create undue delays. On particularly long lists (100+ recipients), a signed mail might take as long as 10 seconds to send to the next recipient. Frankly, that delay counted as a low-priority usability issue; anything requiring authorization of 16 percent of the company probably should be handled some other way. A week of pouring through logs showed no errors in either tracked routed mail or standard mail delivery.

At a complete loss as to where to go, I asked the manager to send out the support crew. Maybe feet on the floor could dig up a user error of some kind. One week later, nothing seemed out of bounds. Users routed mail like madmen with nary a hitch. Yet we still received persistent, nagging reports of problems to the point where the CEO was becoming annoyed.

During a lunch with one of my counterparts on the teleco team, the glimmer of an answer occurred to me. The phone system upgrade, for whatever reason, seemed plagued by logistical difficulties. Missed deadlines, misplaced orders, and badly configured circuits continued to plague the project. Could their bad karma somehow affect our system?

In fact, correlating the sites the teleco team worked on with places reporting the "routing problem" revealed a 99 percent match. Yet their project had nothing at all to do with our network connectivity. A quick bit of research confirmed that supposition, revealing no corresponding network outages.

Life in the opaque box
With no real system problems apparent, I started to wonder if another explanation might suggest itself. We knew from our investigations that client contracts, orders, and other things were being delayed. So, if the system was not to blame, who or what was?

My thoughts turned to the user community. They were blaming me for their problems. Sitting on my borrowed desk was a Far Side comic, the one with a professor at the whiteboard, with "and then a miracle occurs" in his equations. It occurred to me that we are the miracle in the equation; we are the part of the system that no one understands.

For most users, even in today's sophisticated workforce, IT infrastructure is a closed box in which events occur that produce results. They understand the workings of their own part of the system. They have a general understanding of the activities of other employees who relate to their jobs. The opaqueness of IT's activities allows us to be the scapegoat for any and all ills.

The low social standing of many IT departments exacerbates this situation. In most organizations, IT workers don't participate in the constant give and take of social exchange. They hide in their cubicles, working feverishly to maintain the system. This behavior isolates them from their coworkers. Since no one knows them, or understands what they do, everyone feels free to blame the IT department.

All of this matches with both classical sociological theory and systems design theory. The question is what do we do about it? If the black box will always be blamed for problems, how do we as professionals approach the issue?

In the world of consulting (as opposed to operations, management, or support), we use three basic tools to "break the black box":
  1. Impartial arbitrator: As the consultant, we are outside the organization. We have no political stake in the conflicts between the departments. In the above case, I pointed out to the CEO that the IT department (networking, servers, programming, and teleco) had nothing to do with the problem. My explanation convinced him to step back and assess how we needed to proceed.
  2. Implicit social status: Being consultants, we have an implicit level of social status based on the status of the person who authorized our presence. In the above case, I used the CEO's agreement as a leverage point to schedule meetings with regional managers. At those meetings, we talked about how the IT department could and did monitor a wide variety of activities, including mail submission, downloads, and installed software.
  3. Sacrifice play: Although we hate to admit it, sometimes consultants are hired to either fail or sacrifice themselves on an organization’s altar of political expediency. We sometimes have to take the fall for a problem, whether it exists or not. In the above case, I accepted the "blame" for the problem, claiming that I incorrectly configured a part of the primary system. I actually did make a handful of tuning changes to the servers, kicking internal mail delivery time down another half second, on average. We explained the situation to the CEO, who sent out a memo crucifying me and letting the company know that everything was resolved.

After taking these steps, the social aspect of the problem subsided. I'm sure that others arose over time, since we made no direct impact on the core issue. I would like to think that our course of action helped to open lines of communication within the organization, but experience tells me that communication cannot flow without constant effort on all sides.


Editor's Picks