Apps

How far is too far to troubleshoot on servers?

When it comes to ensuring that critical workloads are kept highly available, keen troubleshooting skills are necessary. But how much troubleshooting effort should be exerted? IT pro Rick Vanover discusses the pros and cons of each approach.

Recently, I was in a discussion with a peer about troubleshooting. The result of the discussion was that this person prefers to go ahead and restore a system from a backup whereas I really usually want to find the root cause of an issue. There may be many different ways of going about obtaining the root cause of an issue, but my main takeaway is that I’d like to prevent the issue from happening again.

On the other hand, getting the environment back up and running as quickly as possible is a dutiful intention as well. Of course there is no straight answer for all situations, as each environment is different. The example mentioned earlier was a typical web application server that interacted with a central database, which was operating as expected. Further, the web application worked in a pool of multiple similar systems, which were each operating fine as well. In that situation, the rebuild or restore option sounds fine as there is really minimal risk.

For me, the question remains: how much time is too much to spend on troubleshooting a server? We are all so busy, and the last thing we really have time to do is go far down a rat hole of information that may contain many different logs, vendor support cases and other tasks that may give little value to have a full explanation of root cause. Depending on the application, full root cause information may of course be a requirement.

This reminds me of an interview question I received once much earlier in my career. The question was fine enough, “What’s your biggest weakness in the workplace?” My answer was that I spend too much time troubleshooting. Troubleshooting is a good thing; too much time was the bad part. The answer didn’t slow me down, as I got the job.

The main goal that I approach troubleshooting with, and specifically root cause, is to ensure that the issue doesn’t happen again. A number of resolution paths may present themselves, some of which are planned and created impromptu. But, the question remains how much time to we spend in troubleshooting mode beyond resolution? Share your approach to troubleshooting below.

About

Rick Vanover is a software strategy specialist for Veeam Software, based in Columbus, Ohio. Rick has years of IT experience and focuses on virtualization, Windows-based server administration, and system hardware.

35 comments
Nov7
Nov7

Taking into consideration your problem's variables should give you a good advantage. Going the extra mile is not always the best policy. Specially if time is of the essence. It seems to me that applying a bit of the so scarce common sense should do the trick. Weighing the pros and cons of taking the time to nip the problem in the bud once and for all, could put you in charged of the IT department, if the scenario had taken place in the communications room of a retirement home. On the other hand a tiny delay may have translated into professional suicide, had this glitch happened on May 6Th, 2010 on the Wall Street's Stock Exchange floor and you were the man in charged - "Flash Crash" day -. I'm confident that presented with the situation we'll all perform to the best of our abilities.

ccnp
ccnp

This is a fundamental Quality Insurance problem. Avoidance goals ("make sure it does not fail xxx times per yy) are always harder to sell because people only see the importance of hitting them when you didn't ! The company should have a long term strategy and a long-term monitoring policy. "When the house is burning, it's no time to build a fire-station but you'd better be building one in your spare time to be better prepared the next time ;) Long-term monitoring and a great deal of communication to objectively convince yourself and the stakeholders that this truly matters. This * should * help improve the effectiveness of the troubleshooting and reduce the load over time. If it does not, then you * know * you have a fundamental problem and you get a chance to ask for resources to work on a solution... Bottom line: Ask yourself the question often, look over the long-term, talk to your peers, track objective metrics, make structural changes and evaluate the impact of past decisions. Have fun with it !

nigabud
nigabud

Its really relative. Spending too much time getting to the root cause makes the user whale. Not fixing the problem tells the MD and finance that you dont know how to do your job. The key to any administrator/technician is to learn how to manage the expectations of the user. Give them a time frame, feedback in a language that they understand.

b4real
b4real like.author.displayName 1 Like

But, I hope we don't get lazy and forget our troubleshooting skills over the years.

pgit
pgit

I mentioned in another post that for me, it's probably a good thing job one doesn't include any but the most rudimentary troubleshooting (is it plugged in?) if the service is down. I hope everyone else enjoys the luxury I do; if a situation looks to me to be likely to repeat, I can try to convince owners/responsible parties that they should have me spend some time digging into the why of it. I can always ask. Sometimes they'll see the wisdom. But I can't offend anyone over my head because the only folks upstairs are my wife and Jesus. ;)

hauskins
hauskins

The issue for me is an environment that has particular hardware and software that meet a criteria for dependability. I am not pushing one hardware/software but my experience (your mileage may vary) with SUN hardware (Oracle) and Solaris on Sparc and Intel/ADM has been very good. I will also toss in CentOS as an OS on the SUN x86 platforms is stable and reliable. Solaris has the ability to restart failed processes (daemons) automatically, as well as, having the self healing diagnostic system that attempts to circumvent problems and keep running, but alert the server admins there is a problem. I have SUN servers that provide a number of services and some of them run for over a year without problems (as long as I have UPS/Emergency power). Granted that applications can make a difference in how long a server stays operational, especially when there is a heavy continuing load. The only time I troubleshoot is when a reboot doesn't fix the problem or a server has been hacked. I think the future will see less troubleshooting for common issues, since as one poster points out VM environments can quickly recover from server failures.

pgit
pgit

You should be able to set that up with any Linux distro, so long as you can get something like nagios going, eg script a query to a backend db, and if no reply, run /etc/init.d/xyx.. restart, or what have you.

b4real
b4real

Too bad Windows isn't as good at that task. Though, I don't know how good Linux is either at it.

pgit
pgit

Yeah, at least with windows they don't promise anything. I've had very good luck with just about every type of automation I've set up with Linux. It can be a bear getting it set up, but once in motion it's hard to stop. My biggest concern is keeping track of the integrity of backups. (system in this case) At the risk of jinxing myself and sealing the deal, I might be facing a bad backups-disaster as we speak... This one could be painful.

IT
IT like.author.displayName 1 Like

This is one area VM's come into there own. You can clone the server to a template and restore the server from tape or do a full rebuild. You can even do the rebuild while the failing server is still in production. Once the issue is resolved move the fault server in to a test and development environment and trouble shoot to your hearts content.

b4real
b4real

Makes it tempting to simply redeploy a known working system.

tony
tony

I always like to understand the cause of problems. If you don't know the cause, you don't know if it was a one off or is recurring. Not only that, most real disasters are rarely a single problem, in my experience. It is usually a combination of several problems (at one customer he ignored the RAID warnings, hard rebooted and the RAID rebooted and rebult from the older copy in the mirror)

pgit
pgit

Has anyone ever seen only encrypted portions of a file system get corrupted? I had this happen recently. Regardless of where they were in the file system, and what the file was for, (password hashes, people's encrypted work...) it was all corrupted on this one system. The OS was running fine by the time I got to the machine, everything not encrypted hadn't been touched. This happened after a weekend of bad storms, the power was in "brown out" condition for much of the weekend, and with a couple critical machines on battery and running fine I didn't get any notifications. (I'd like to be monitoring the 110v coming off the utility) This building is an electrical disaster area, it has surges and voltage drops regularly. I assumed that the encrypted stuff was corrupted by something in the system that itself responded inappropriately when faced with dirty power. I've measured the mains in this place with a meter and have seen voltage drop to between 60-70 volts in some locations, and surge to 160 in others at the same time. (grounding issue, I've begged the tenants to get the bldg owner over to fix it) The computer in question is on surge protection but not battery, so it was half burning along on ~70-80 volts for who knows how long. It obviously didn't get low enough to turn the computer off completely, these machines are set to stay off after a power outage. Was it coincidence? Or has anyone seen where only encrypted data is effected by something/anything?

pgit
pgit

I just found out the building has a new owner. Is he in for a surprise... I've been authorized to get this taken care of by the tenant, although electric utilities aren't in my normal purview. They had "mentioned" this a few times, apparently. I assumed they hadn't asked to have it fixed because I also assumed an owner would be a bit concerned about losing the property or being sued for damages. Now I realize the fellow wasn't going to sink another dime in the place, which also explains the wet ceiling panels in one of the rooms. (thankfully nowhere near the computers) It's still exceedingly weird to me that only encrypted areas of the drive were corrupted, but I guess there's something to it vis the power problems. Maybe there was an instant in which some hash was altered randomly in memory, some area being used to maintain the translation. After that, any writes would result in data that later can't be unencrypted with the key stored on the hard drive, that would be used the next time the system booted... or not. =P Pretty interesting, actually. I almost wish I could justify getting to the bottom of this one.

Charles Bundy
Charles Bundy

coupled with more frequent writes to encrypted data corrupts absolutely! Speaking of power I'd have the utility check and make sure there aren't any loose neutrals on incoming transformers... Start at the pole and work your way in...

pgit
pgit

This building has been many things, from a warehouse to a grocery store. The whole thing is a nightmare, plumbing, electric, the poorly laid out half walls... we'd always assumed the problem is inside, but I'll insist they check their equipment on the pole. 3 good sized transformers and they look WWII vintage. An example, for a long time there was a silvery cable, with that old wrapping like you find on 220 lines from the 50's, dangling from the end of a wall down to about eye level. One of the girls in this place decided to cut it off where it came out of the wall because they were going to start using this area. She discovered the hard way it was live. She was on a ladder with a large scissor cranking on it and , a shower of sparks. Now that the thing was twice the fire hazard it had been, they girl got off the latter and on the phone to an electrician. I should get some pics of this place above the ceiling. It really is something to behold.

BarryRobbins76
BarryRobbins76

I was using Comodo on my computer a couple of years back and the same exact thing happened. It corrupted all of the files encrypted with EFS. I would take a look at your AV solution and see if this is a known issue.

pgit
pgit

Interesting, it doesn't have comodo AV, (AVG) but this box does have comodo backup. Curious. I wonder if there is any relation...

b4real
b4real

I usually find myself disabling it, then either adding a different product - or setting up policy exclusions as well.

dca
dca

I would suggest that encrypted data just happens to be under the disk head(s) when the drive incurrs a static hit (for whatever reason). The power supplies are doing way too much work though - call in the electric company or your Power Super Star (everyone's got one of those guys) to put a meter on your mains to the computer room.

Charles Bundy
Charles Bundy

Yeow. If you are measuring AC voltage fluctuations like that I bet phase was all over the place. I'd say ALL writes to secondary storage are suspect, but in the unencrypted areas you won't be doing the write cycle as often as you would in the encrypted area. Plus in the unencrypted area you wouldn't notice bit rot at the file system level like you would in the encrypted area.

TimH.
TimH.

I recently had an instance where there was a red light on one of our servers. Our maintenance company insisted on a photograph of the red light - honestly. They then decided that they did not know what the cause of the red light was and that they were doing to replace the server as a result - despite the fact that the server was produced by them. Some people have no technical skills and trouble shooting seems to be too much trouble.

b4real
b4real

The facility security thought it was out of place and sent it to me once. It wasn't my server (outsource partner), but I thought it was pretty good that they did that.

Jesper_L
Jesper_L

Whether you choose to get the system back running as fast as possible or whether you decide to first find the root cause of the issue, I think it's important that you have all the various tools (I use RDP and KVM IP) for troubleshooting ready at hand. Also remote power control is for me a must (today, I even do power reboot from within the KVM session).

mgosby
mgosby

Root cause analysis and restoration of productivity are two essential aspects of network management and require overlapping skills. I've found it helpful in the environments that I manage to have responsibility for these job functions assigned to two different parties that leverage communication through the narrative issue/troubleshooting discription, analysis of log files and technical meetings as methods to best use time while serving both needs.

pgit
pgit

So true, 2 heads are better than one. Alas, for me this is a luxury I don't often enjoy.

Charles Bundy
Charles Bundy like.author.displayName 1 Like

Usually doing a root cause analysis (RCA) is part and parcel of the troubleshooting/solution process. We had a guy who would reimage computers at the drop of a hat to fix problems. That is the "world is a nail given that I have a hammer" view. It fixes the issue, but no learning occurs and no patterns are documented to proactively fix potential process/infrastructure issues. I guess I don't view the troubleshooting as necessarily being separate or taking extra time. If one goes cowboy as in the above example you have the inverse problem of 'not going far enough'. At the end of the day you can never go 'too far' so long as your end goal is to restore service back to point-of-failure. When we lose sight of that and make the problem root cause our end goal we have gone too far. Time to resolve then becomes an overriding factor.

mbrumm
mbrumm

I do not troubleshoot a single incident. Restore the service ASAP if a quick (10 Minutes or less) post resolution cause can be determined that is fine. Now problem (re-occurring incidents)resolution demands a detailed diagnostic. This is a basic ITIL concept.

support
support

The primary goal has to be to get the user back up and running. Once that is done then find out why it happened and what can be done to prevent it in the future. Most of the time a reboot takes care of the immediate issue. Then you can go through the logs and files and find out why it happened and try to resolve it.

b4real
b4real

On repeat offenses, maybe a different approach to troubleshooting may be required.

Craig_B
Craig_B like.author.displayName 1 Like

If you are learning and coming into a deeper understanding of something so that you can prevent the issue in the future or resolve it more quickly than spending the time to troubleshoot further is a good thing. Of course this is based on a case by case basis and through experience. If the fix is, reboot the server (which gets it back up and running in a few minutes) but you don't have a clue what is wrong then maybe you should invest some troubleshooting time to increase your knowledge and understanding. If you???re going to completely replace the server next month then maybe it's not worth the time to understand this particular issue. As I write this another thought comes to mind, troubleshooting is an investment, the key is to understand the return on investment to determine the worth of the process. [Side Note: I had to troubleshoot entering this message since IE 9 failed to submit it, so I switched to FF 5.]

pgit
pgit

Was that a one time glitch in sending or does it fail consistently? (no doubt some default "security" setting if so) I caught myself in the nick of time yesterday doing updates on a vista machine, I always look at 'suggested' updates and the "9" didn't register right off... aborted the updates after a minute or so when the ramifications of an IE9 upgrade hit me. I have one user (that I know of and who was using IE) that got really cheesed when he installed IE9, he called me to to get 8 back on it. When I got there I saw he had a chrome icon on the desktop, I asked him if he liked that and he said he did. So I told him either just use that or I'll have a seat and turn the meter on... problem solved. :)

mhunter392
mhunter392

I agree with Rick's basic approach. I too prefer to troubleshoot why a problem existed, to prevent a re-occurrence. However, when the server must be back up operational as fast as possible, I will usually dump all logs (event viewer, application, database, etc) to a special folder/thumb drive for later review, then get the server on-line ASAP.

pgit
pgit

It drives me nuts when I don't know what's going on. Good thing "job one" is getting the services up asap, I have a tendency to blurt out disconcerting comments when troubleshooting, a lot of "uh-oh!s" "what the hell?s" and such. I was helping another contractor once and he pointed this out to me. He said the customer looked at me with increasing worry as I muttered out my thoughts. So yeah, good thing the job is to get the server back on line. I can "I don't believe this!?!" at my leisure back in the lab where no customer ever goes. :)

b4real
b4real

That getting root cause isn't worth it. I guess that is an opportunity for better "upward communication" on my part.