Discussion on:

35
Comments

Join the conversation!

Follow via:
RSS
Email Alert
I agree with Rick's basic approach. I too prefer to troubleshoot why a problem existed, to prevent a re-occurrence. However, when the server must be back up operational as fast as possible, I will usually dump all logs (event viewer, application, database, etc) to a special folder/thumb drive for later review, then get the server on-line ASAP.
0 Votes
+ -
ditto
pgit 27th Jun 2011
It drives me nuts when I don't know what's going on. Good thing "job one" is getting the services up asap, I have a tendency to blurt out disconcerting comments when troubleshooting, a lot of "uh-oh!s" "what the hell?s" and such.

I was helping another contractor once and he pointed this out to me. He said the customer looked at me with increasing worry as I muttered out my thoughts.

So yeah, good thing the job is to get the server back on line. I can "I don't believe this!?!" at my leisure back in the lab where no customer ever goes. happy
0 Votes
+ -
Contributr
That getting root cause isn't worth it. I guess that is an opportunity for better "upward communication" on my part.
If you are learning and coming into a deeper understanding of something so that you can prevent the issue in the future or resolve it more quickly than spending the time to troubleshoot further is a good thing. Of course this is based on a case by case basis and through experience.

If the fix is, reboot the server (which gets it back up and running in a few minutes) but you don't have a clue what is wrong then maybe you should invest some troubleshooting time to increase your knowledge and understanding. If you???re going to completely replace the server next month then maybe it's not worth the time to understand this particular issue.

As I write this another thought comes to mind, troubleshooting is an investment, the key is to understand the return on investment to determine the worth of the process.

[Side Note: I had to troubleshoot entering this message since IE 9 failed to submit it, so I switched to FF 5.]
0 Votes
+ -
IE9 question
pgit 27th Jun 2011
Was that a one time glitch in sending or does it fail consistently? (no doubt some default "security" setting if so)

I caught myself in the nick of time yesterday doing updates on a vista machine, I always look at 'suggested' updates and the "9" didn't register right off... aborted the updates after a minute or so when the ramifications of an IE9 upgrade hit me.

I have one user (that I know of and who was using IE) that got really cheesed when he installed IE9, he called me to to get 8 back on it. When I got there I saw he had a chrome icon on the desktop, I asked him if he liked that and he said he did. So I told him either just use that or I'll have a seat and turn the meter on... problem solved. happy
The primary goal has to be to get the user back up and running. Once that is done then find out why it happened and what can be done to prevent it in the future. Most of the time a reboot takes care of the immediate issue. Then you can go through the logs and files and find out why it happened and try to resolve it.
On repeat offenses, maybe a different approach to troubleshooting may be required.
I do not troubleshoot a single incident. Restore the service ASAP if a quick (10 Minutes or less) post resolution cause can be determined that is fine. Now problem (re-occurring incidents)resolution demands a detailed diagnostic. This is a basic ITIL concept.
Usually doing a root cause analysis (RCA) is part and parcel of the troubleshooting/solution process. We had a guy who would reimage computers at the drop of a hat to fix problems. That is the "world is a nail given that I have a hammer" view. It fixes the issue, but no learning occurs and no patterns are documented to proactively fix potential process/infrastructure issues.

I guess I don't view the troubleshooting as necessarily being separate or taking extra time. If one goes cowboy as in the above example you have the inverse problem of 'not going far enough'.

At the end of the day you can never go 'too far' so long as your end goal is to restore service back to point-of-failure. When we lose sight of that and make the problem root cause our end goal we have gone too far. Time to resolve then becomes an overriding factor.
0 Votes
+ -
Separation of Duties as a Path
mgosby@... Updated - 27th Jun 2011
Root cause analysis and restoration of productivity are two essential aspects of network management and require overlapping skills. I've found it helpful in the environments that I manage to have responsibility for these job functions assigned to two different parties that leverage communication through the narrative issue/troubleshooting discription, analysis of log files and technical meetings as methods to best use time while serving both needs.
0 Votes
+ -
2 heads theory
pgit 27th Jun 2011
So true, 2 heads are better than one. Alas, for me this is a luxury I don't often enjoy.
Whether you choose to get the system back running as fast as possible or whether you decide to first find the root cause of the issue, I think it's important that you have all the various tools (I use RDP and KVM IP) for troubleshooting ready at hand. Also remote power control is for me a must (today, I even do power reboot from within the KVM session).
0 Votes
+ -
Job too far
TimH. 27th Jun 2011
I recently had an instance where there was a red light on one of our servers. Our maintenance company insisted on a photograph of the red light - honestly. They then decided that they did not know what the cause of the red light was and that they were doing to replace the server as a result - despite the fact that the server was produced by them. Some people have no technical skills and trouble shooting seems to be too much trouble.
0 Votes
+ -
Contributr
The facility security thought it was out of place and sent it to me once. It wasn't my server (outsource partner), but I thought it was pretty good that they did that.
Has anyone ever seen only encrypted portions of a file system get corrupted? I had this happen recently. Regardless of where they were in the file system, and what the file was for, (password hashes, people's encrypted work...) it was all corrupted on this one system. The OS was running fine by the time I got to the machine, everything not encrypted hadn't been touched.

This happened after a weekend of bad storms, the power was in "brown out" condition for much of the weekend, and with a couple critical machines on battery and running fine I didn't get any notifications. (I'd like to be monitoring the 110v coming off the utility)

This building is an electrical disaster area, it has surges and voltage drops regularly. I assumed that the encrypted stuff was corrupted by something in the system that itself responded inappropriately when faced with dirty power. I've measured the mains in this place with a meter and have seen voltage drop to between 60-70 volts in some locations, and surge to 160 in others at the same time. (grounding issue, I've begged the tenants to get the bldg owner over to fix it)

The computer in question is on surge protection but not battery, so it was half burning along on ~70-80 volts for who knows how long. It obviously didn't get low enough to turn the computer off completely, these machines are set to stay off after a power outage.

Was it coincidence? Or has anyone seen where only encrypted data is effected by something/anything?
0 Votes
+ -
Aieee, death from above!
Charles Bundy Updated - 27th Jun 2011
Yeow. If you are measuring AC voltage fluctuations like that I bet phase was all over the place. I'd say ALL writes to secondary storage are suspect, but in the unencrypted areas you won't be doing the write cycle as often as you would in the encrypted area. Plus in the unencrypted area you wouldn't notice bit rot at the file system level like you would in the encrypted area.
I would suggest that encrypted data just happens to be under the disk head(s) when the drive incurrs a static hit (for whatever reason).
The power supplies are doing way too much work though - call in the electric company or your Power Super Star (everyone's got one of those guys) to put a meter on your mains to the computer room.
0 Votes
+ -
thanks, guys
pgit 28th Jun 2011
I just found out the building has a new owner. Is he in for a surprise... I've been authorized to get this taken care of by the tenant, although electric utilities aren't in my normal purview.

They had "mentioned" this a few times, apparently. I assumed they hadn't asked to have it fixed because I also assumed an owner would be a bit concerned about losing the property or being sued for damages. Now I realize the fellow wasn't going to sink another dime in the place, which also explains the wet ceiling panels in one of the rooms. (thankfully nowhere near the computers)

It's still exceedingly weird to me that only encrypted areas of the drive were corrupted, but I guess there's something to it vis the power problems. Maybe there was an instant in which some hash was altered randomly in memory, some area being used to maintain the translation. After that, any writes would result in data that later can't be unencrypted with the key stored on the hard drive, that would be used the next time the system booted... or not. =P

Pretty interesting, actually. I almost wish I could justify getting to the bottom of this one.
I was using Comodo on my computer a couple of years back and the same exact thing happened. It corrupted all of the files encrypted with EFS. I would take a look at your AV solution and see if this is a known issue.
0 Votes
+ -
Contributr
I usually find myself disabling it, then either adding a different product - or setting up policy exclusions as well.
0 Votes
+ -
wow
pgit 29th Jun 2011
Interesting, it doesn't have comodo AV, (AVG) but this box does have comodo backup. Curious. I wonder if there is any relation...
0 Votes
+ -
coupled with more frequent writes to encrypted data corrupts absolutely!

Speaking of power I'd have the utility check and make sure there aren't any loose neutrals on incoming transformers... Start at the pole and work your way in...
0 Votes
+ -
This building has been many things, from a warehouse to a grocery store. The whole thing is a nightmare, plumbing, electric, the poorly laid out half walls... we'd always assumed the problem is inside, but I'll insist they check their equipment on the pole. 3 good sized transformers and they look WWII vintage.

An example, for a long time there was a silvery cable, with that old wrapping like you find on 220 lines from the 50's, dangling from the end of a wall down to about eye level. One of the girls in this place decided to cut it off where it came out of the wall because they were going to start using this area. She discovered the hard way it was live. She was on a ladder with a large scissor cranking on it and , a shower of sparks.

Now that the thing was twice the fire hazard it had been, they girl got off the latter and on the phone to an electrician.

I should get some pics of this place above the ceiling. It really is something to behold.
0 Votes
+ -
I always like to understand the cause of problems. If you don't know the cause, you don't know if it was a one off or is recurring.
Not only that, most real disasters are rarely a single problem, in my experience. It is usually a combination of several problems (at one customer he ignored the RAID warnings, hard rebooted and the RAID rebooted and rebult from the older copy in the mirror)
1 Vote
+ -
This is one area VM's come into there own. You can clone the server to a template and restore the server from tape or do a full rebuild. You can even do the rebuild while the failing server is still in production.
Once the issue is resolved move the fault server in to a test and development environment and trouble shoot to your hearts content.
0 Votes
+ -
Contributr
Makes it tempting to simply redeploy a known working system.
0 Votes
+ -
The issue for me is an environment that has particular hardware and software that meet a criteria for dependability. I am not pushing one hardware/software but my experience (your mileage may vary) with SUN hardware (Oracle) and Solaris on Sparc and Intel/ADM has been very good. I will also toss in CentOS as an OS on the SUN x86 platforms is stable and reliable.

Solaris has the ability to restart failed processes (daemons) automatically, as well as, having the self healing diagnostic system that attempts to circumvent problems and keep running, but alert the server admins there is a problem.

I have SUN servers that provide a number of services and some of them run for over a year without problems (as long as I have UPS/Emergency power). Granted that applications can make a difference in how long a server stays operational, especially when there is a heavy continuing load.

The only time I troubleshoot is when a reboot doesn't fix the problem or a server has been hacked.

I think the future will see less troubleshooting for common issues, since as one poster points out VM environments can quickly recover from server failures.
You should be able to set that up with any Linux distro, so long as you can get something like nagios going, eg script a query to a backend db, and if no reply, run /etc/init.d/xyx.. restart, or what have you.
0 Votes
+ -
Contributr
Too bad Windows isn't as good at that task. Though, I don't know how good Linux is either at it.
0 Votes
+ -
lol
pgit 5th Jul 2011
Yeah, at least with windows they don't promise anything.

I've had very good luck with just about every type of automation I've set up with Linux. It can be a bear getting it set up, but once in motion it's hard to stop. My biggest concern is keeping track of the integrity of backups. (system in this case)

At the risk of jinxing myself and sealing the deal, I might be facing a bad backups-disaster as we speak... This one could be painful.
Its really relative. Spending too much time getting to the root cause makes the user whale. Not fixing the problem tells the MD and finance that you dont know how to do your job. The key to any administrator/technician is to learn how to manage the expectations of the user. Give them a time frame, feedback in a language that they understand.
1 Vote
+ -
Contributr
Agreed
b4real@... 28th Jun 2011
But, I hope we don't get lazy and forget our troubleshooting skills over the years.
0 Votes
+ -
seconded
pgit 29th Jun 2011
I mentioned in another post that for me, it's probably a good thing job one doesn't include any but the most rudimentary troubleshooting (is it plugged in?) if the service is down.

I hope everyone else enjoys the luxury I do; if a situation looks to me to be likely to repeat, I can try to convince owners/responsible parties that they should have me spend some time digging into the why of it.

I can always ask. Sometimes they'll see the wisdom. But I can't offend anyone over my head because the only folks upstairs are my wife and Jesus. wink
0 Votes
+ -
This is a fundamental Quality Insurance problem. Avoidance goals ("make sure it does not fail xxx times per yy) are always harder to sell because people only see the importance of hitting them when you didn't !

The company should have a long term strategy and a long-term monitoring policy.
"When the house is burning, it's no time to build a fire-station but you'd better be building one in your spare time to be better prepared the next time wink

Long-term monitoring and a great deal of communication to objectively convince yourself and the stakeholders that this truly matters. This * should * help improve the effectiveness of the troubleshooting and reduce the load over time. If it does not, then you * know * you have a fundamental problem and you get a chance to ask for resources to work on a solution...

Bottom line:
Ask yourself the question often, look over the long-term, talk to your peers, track objective metrics, make structural changes and evaluate the impact of past decisions.
Have fun with it !
0 Votes
+ -
Taking into consideration your problem's variables should give you a good advantage. Going the extra mile is not always the best policy. Specially if time is of the essence. It seems to me that applying a bit of the so scarce common sense should do the trick. Weighing the pros and cons of taking the time to nip the problem in the bud once and for all, could put you in charged of the IT department, if the scenario had taken place in the communications room of a retirement home. On the other hand a tiny delay may have translated into professional suicide, had this glitch happened on May 6Th, 2010 on the Wall Street's Stock Exchange floor and you were the man in charged - "Flash Crash" day -. I'm confident that presented with the situation we'll all perform to the best of our abilities.
Keyboard Shortcuts:
Prev
Next
Toggle
Join the conversation
Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

Join the TechRepublic Community and join the conversation! Signing-up is free and quick, Do it now, we want to hear your opinion.