The many layers of reliability

Take another look at how you define and measure server reliability.

We have all fought the reliability battle. Our clients complain about how unstable the system is while we struggle to keep servers, routers, and switches humming along at a reasonable level of stability. Meanwhile the clients bombard the help desk with calls, then turn around to blame the staff for every hiccup in the system. When senior executives finally step in, they take the client's side, as they also experience whatever pain exists in the environment. As the blame starts to fall onto various people at random, the question becomes: What is stability and who measures it? I learned this lesson the hard way when working on a support procedure improvement project for a client.

My client, a midsized (3,000 or so nodes) group with locations in 20 states and four countries, called my company in to help with their constant "reliability" problems. They wanted us to assess their environment and give them specific technical and procedural solutions to "key problem areas." I went in as a junior member of a relatively small team.

After two weeks of assessment, we identified a few obvious problems. The server team had installed MS Exchange and MS SQL Server on the same disk array in the satellite offices. The network team demonstrated a bizarre tendency to ignore the foreign offices when scheduling core router outages. Three of the clients were "frequent flyers" when it came to locking themselves out of their security domain; their lockout frequency was two orders of magnitude higher than any other users. We recommended splitting the arrays to resolve the disk contention issue caused by the dual-purpose servers, avoiding scheduling down times during the European order entry period, and additional training for the users who locked themselves out. The client thanked us profusely and then scheduled a six-month follow-up to verify their success.

We came back fully expecting to find the reliability problems resolved. Technically, they were. The client's IT team had, a bit reluctantly, implemented our suggestions. The uptime data from the equipment indicated our changes had the desired effect. Servers no longer crashed at predictable intervals. The links to Europe stayed up during their order entry periods. Account lockouts were down over 80 percent.

Unfortunately, the clients still regularly complained about network stability. The dramatic improvements in technical stability had not translated into a noticeable improvement in client satisfaction with the system. Why?

Measuring reliability: What do we measure?
While the IT team smugly looked on, we set about trying to ferret out the reasons behind the problem. The architect assigned to the project, the man who mentored me through my early years, knew something unusual lurked in the wings. We made a number of phone calls, sometimes posing as potential customers while watching the system to track the data stream.

Two weeks of work later, we dug out the following points of contention:
  • The clients felt their problems received attention only when they couldn't get their jobs done. Therefore, when they wanted immediate attention, they claimed the current issue prevented them from getting work done. Anything flagged as "preventing work" listed as a reliability failure when the help desk database presented the report.
  • The IT team felt that any error not resulting in a system reboot did not count as a failure, since their bonus structure partially relied on have a zero-failure environment. A server that did not need a reboot to restore service never "failed," although it provided no services to the client.
  • Clients generally did not distinguish between a network failure, a server failure, a service failure, or a security lockdown. They saw everything as a failure with the system. This made the help desk tickets unreliable sources for system stability tracking, despite the executives' reliance on them.

The executive sponsor thought we completed our work with this basic analysis. My mentor disagreed. He prepared a report outlining how the company was creating continual problems for itself by measuring three disparate things and then trying to compare them:
  • Client perception of reliability, which involves service accessibility, training, usability, corporate culture, political fallout from various projects, and personality conflicts with local and central support.
  • Technical analysis of the equipment uptime without regard to its usability. Verifying system uptime is a good first step, but cannot be the alpha and omega of reliability.
  • Executive information system analysis with the assumption that all reporters have equal access to base-level information. This created confusing data. This data in turn led the executive staff to keep trying a technical solution to a procedural and communications problem.

In order to address these issues, and avoid a repeat call, our team suggested the IT staff and the executives take a more active role in their initial data analysis. Rather than relying on canned reports, we designed four basic survey instruments they could use to query the user community and correlate IT service data with known client issue patterns.

That last survey instrument proved key to the corporate IT team's future success. By forcing the corporate team and the executives to correlate outage times with the text of client problem reports, they uncovered a host of potential issues. More importantly, it forced the corporate team to understand their clients' needs rather than force technical solutions down their throats.

Editor's Picks