How far is too far to troubleshoot on servers?

When it comes to ensuring that critical workloads are kept highly available, keen troubleshooting skills are necessary. But how much troubleshooting effort should be exerted? IT pro Rick Vanover discusses the pros and cons of each approach.

Recently, I was in a discussion with a peer about troubleshooting. The result of the discussion was that this person prefers to go ahead and restore a system from a backup whereas I really usually want to find the root cause of an issue. There may be many different ways of going about obtaining the root cause of an issue, but my main takeaway is that I’d like to prevent the issue from happening again.

On the other hand, getting the environment back up and running as quickly as possible is a dutiful intention as well. Of course there is no straight answer for all situations, as each environment is different. The example mentioned earlier was a typical web application server that interacted with a central database, which was operating as expected. Further, the web application worked in a pool of multiple similar systems, which were each operating fine as well. In that situation, the rebuild or restore option sounds fine as there is really minimal risk.

For me, the question remains: how much time is too much to spend on troubleshooting a server? We are all so busy, and the last thing we really have time to do is go far down a rat hole of information that may contain many different logs, vendor support cases and other tasks that may give little value to have a full explanation of root cause. Depending on the application, full root cause information may of course be a requirement.

This reminds me of an interview question I received once much earlier in my career. The question was fine enough, “What’s your biggest weakness in the workplace?” My answer was that I spend too much time troubleshooting. Troubleshooting is a good thing; too much time was the bad part. The answer didn’t slow me down, as I got the job.

The main goal that I approach troubleshooting with, and specifically root cause, is to ensure that the issue doesn’t happen again. A number of resolution paths may present themselves, some of which are planned and created impromptu. But, the question remains how much time to we spend in troubleshooting mode beyond resolution? Share your approach to troubleshooting below.