As a system administrator, I've spent over 25 years in technology and seen all the ups and downs of troubleshooting and resolving problems. In the heat of the moment, as stress and frustration pile up, desperation can cause IT staff to make bad choices which further compound the problems and delay resolution.
Whether trying to diagnose a single device or dealing with the urgency of a company-wide outage, there are solid best practices on what NOT to do. With that in mind, here are 10 things to avoid doing, so you can limit the pain and keep things running as smoothly as possible:
1. Working alone
There's nothing lonelier than troubleshooting a serious issue on your own — especially in the middle of the night, or amidst a busy day when coworkers are engaged elsewhere, for that matter. Many IT staffers have a "go it alone" mentality but when serious problems are underway that's the last thing you need.
Get support from others. My company has a policy of opening an audio conference bridge when major issues occur so we can all collaborate on the resolution. And believe me: at first I found it cumbersome and limiting, to have to talk about the problem rather than fixing the problem, but the ability to get insight and input from others more than made up for it.
Even if you're facing a minor problem you can't seem to get your head around, consulting a fellow staffer or getting onto an online forum to share ideas with others can work wonders for you.
2. Downplaying the impact
When something is broken it might be tempting to say, "No, no, it's fine — just a quick fix." That can be a huge mistake. Let it be known how serious the issue is. If someone else needs to run interference for you to let you focus on the problem, make it so.
The worst thing you can do is face a major issue without everyone knowing the scope and the extent of the problem. That will severely curtail your ability to properly fix it and move forward. Be honest and forthright, and if mistakes were made, keep them in plain sight. Don't try to hide the root cause if it was a human error, since it will just make the situation worse.
3. Blindly following Google results
A running joke is that IT people just Google "error messages" and apply the recommendations based thereon. Experience has proven this is a bad idea. The internet is full of well-meaning folks (or perhaps wannabe sysadmins) who have an answer for every possible issue. Whether the answers actually hold true is another thing. I've seen more than my share of people who thought they knew what they were doing who insisted most, if not all, problems were caused by DNS or antivirus issues. Not exactly valid for many problems.
Don't just accept their advice at face value. Seek qualified websites for support, rather than user forums with people piping up answers, and above all else don't start making changes because Google led you to some page which seemed to have a hopeful tip. That might just make things worse. You may end up having to back out a bunch of useless changes you came across because someone's brother's cousin's dogcatcher tried it and said it worked (but it applied to an OS from eight years ago).
4. Making an irrevocable change
When attempting to fix an issue, never make a change you can't back out. I found this out the hard way when working on a client's Windows 7 system and I added a registry key without backing up the registry. I later ended up spending 6-8 hours remediating the problem I caused, let alone fixing the original issue.
Always back up files, settings, configuration data or anything else you're changing — even if it means getting a screenshot of the original settings. Never leave yourself at the mercy of fate. You should always be able to undo what you've tried.
5. Using the shotgun approach
When it comes to IT problems, I get the sense of panic many sysadmins feel. We want to get things up and running ASAP and save our jobs. But applying several would-be fixes at once to try to maximize the chances for resolution is actually detrimental to your endeavors.
What if you make three or four changes to a faulty system then reboot it and see if it works as expected? If so, that's all well and good but you'll never know exactly what fixed the problem. If not, now you have added more complexity to the problem and it may take way longer to fix it.
Make a single change at a time, then commit said change. If it doesn't help you, back the change out.
6. Overlooking the obvious
We're all guilty of this in the IT world. Something broke? Maybe it was a hack, a malicious inside endeavor, or an exploited vulnerability, we might think. Or on the other hand, perhaps it was caused by a hardware failure.
Not so fast. I can't tell you how many system problems can be caused by one very simple factor: lack of disk space. It causes all sorts of insanely weird problems, from authentication issues to service failures.
The same applies to all manner of everyday IT issues: expired passwords (these can make service accounts really act up), access permissions, expired SSL certificates and the like. When functionality breaks, start with the basics of the overall picture rather than assuming some outside influence or complex issue caused the problem.
7. Not keeping a log
This is a significant problem. When you're trying to fix things, keeping a log of what you did will pay huge dividends. It gives you a trail to look back on later and keeps you from duplicating your efforts.
I've been there in the field, working at 4 AM to try to revive a dead database server, and having a handwritten log of which database I worked on proved invaluable. I realize it's tedious and time-consuming to keep track of your current activities when you want to fix the issue ASAP, but you're not doing yourself any favors by skipping the need to log what you're doing. When stress, frustration and confusion pile up, you want that record of where you've been and what you've done, believe me.
8. Not considering the ramifications
All too often when we focus on resolving a technology issue we fail to consider the entire ramifications of the solution we want to implement. Understanding the overall picture is critical to success.
For instance,a while back I worked to resuscitate a failed Exchange server which ran out of disk space. I needed to move some of the mailbox databases to an external drive so the server could be fired up again.
The server in question had a USB 2.0 card which could transfer data at a maximum 480 megabits per second. I glumly accepted the fact that the hefty databases would take a while to transfer to the external drive.
Had I been thinking, I would have realized I could have gone out and bought a USB 3.0 card for my system which could transfer data at 5 gigabits per second. Yes, it would have involved a trip to my local computer parts store as well as shutting down the Exchange server - but that system wasn't too happy to begin with. And I would have saved many, many hours by taking this approach rather than waiting for the existing slow data transfer to complete. Be nimble and forward-thinking, rather than working with your existing limitations.
This applies to many other factors as well, of course. Considering ramifications means seeing the end results across all aspects of your endeavors. You have to restore the entire server and it will take 24 hours to do so? Can you just restore the last day's worth of data rather than ten years of information so people can get back to work ASAP?
You can see where I'm going with this. Your solution should be applicable to the scope of the problem and the impact at hand. Feel free to be creative and flexible. Your users will appreciate it.
9. Not holding a post-mortem
All too often those of us in IT just want the problem fixed so we can go onto the next issue. That kind of mentality guarantees we'll face the same issues over and over again. Always hold a post-mortem after the fact to determine:
- What went wrong?
- What could we have done better?
- Is this issue likely to happen again?
- If so, what can we do to prevent it next time?
- Is additional training needed?
- How can we obtain this training?
- Are additional safeguards required?
- How can we ensure all responsible staff know about this issue?
I want to stress that finger-pointing and "blame games" are counter-productive. This step should be approached via a team perspective so as to guarantee the best success for the business going forward.
10. Failing to document
Number nine is meaningless if this isn't all written up, and the documentation kept up to date as circumstances change. I can't stress enough that IT is more than putting out fires: it's about writing the documents necessary to make sure the same fires don't spring up, and ensuring all responsible staff are sufficiently trained to work on these problems next time.
People change positions or leave jobs; technology evolves and operates differently, and user or business requirements change constantly. Knowing how to handle known issues which have cropped up in the past — or which are no longer a threat and can thus be taken out of the documentation matrix — are the key to ensuring the IT department can work on addressing issues in a meaningful and proactive fashion.
10 ways to keep the lights on in IT during the holidays
Not all IT incidents are created equal: How to manage escalations
How to go beyond the reboot to provide topnotch tech support
How to manage users' risky tech habits
Scott Matteson is a senior systems administrator and freelance technical writer who also performs consulting work for small organizations. He resides in the Greater Boston area with his wife and three children.