For the last two weeks I have been focusing every minute of every day on developing and deploying Microsoft Exchange. In addition, my personal life has been booked solid. After weeks of being busy, I could see an enjoyable light at the end of the tunnel: Wednesday evening, nothing to do but play Rogue Spear online. Enter Stage Left: Murphy’s Law.
The crisis occurs: Wednesday morning
Wednesday morning was running smoothly. I had cleared the day in hopes of focusing on developing a customized installation of Office 2000 and Internet Explorer 5.01. After clearing up some general network administration, I began to launch an Office 2000 Resource Kit utility when I received a call reporting an error with login into our single domain.
I've been working for my current employer for a mere two months. In the short time I have been there, I’ve realized that information technology is the center of the operations circle. I have found that the IT structure, policies, and procedures have been quickly assembled due to the lack of human resources and time. The company has tripled in size in the past six months, and the IT department has barely kept up with the daily duties.Needless to say, my company realizes that improvements must be made in the IT arena. Our Novell 5.0-based network utilizes NDS for NT, a handful of NT application servers, and a single Windows NT 4.0 domain model with a PDC and a couple of BDCs. Most of the hardware is in desperate need of upgrade.
The investigation stage
At first, I thought the user was having an individual problem. However, after some testing, I realized that the entire user base was unable to log in to Novell and establish a trust in the NT environment. I let the Director of Information Technology know what was going on as I traveled to the server room, and we began to troubleshoot the problem. Instantly, we noticed a problem with logging in to the PDC. The IT director was attempting to log in to the server and received an error that he was not authorized to log in from this workstation.
About this time, we noticed the convenient beeping of the Novell Server’s PC speaker. The director switched over to Novell when he noticed the server had abended! Needless to say, we were aggravated and were forced to power down the server. I made the necessary announcements to those users who were logged in to the network server. The servers were downed.
The problem discussion
My boss and I discussed what had taken place to try to comprehend what had happened. We each had our own theory. I thought Novell had abended and through NDS had corrupted the PDC/BDC's SAM databases. He thought NT had corrupted and caused Novell to abend. Needless to say, I didn’t want to get into a Network Operating System debate. I just wanted to fix this puppy and get home at 5:30 to play Rogue Spear!
The crucial mistake
My boss powered on the servers, then decided at the last moment to add more memory while we had Novell down. Although this seemed to be an excellent idea, we later realized this was a huge mistake that cost us hours of downtime. We popped in the RAM and powered up Novell. Being from the South, I have been known to make some weird analogies, but honestly, the server was "slower than death.” My boss assumed Novell had died.
Now our problems were growing. Instead of possibly having to rebuild two NT servers, we had to recreate the whole network environment. Remember when I mentioned our somewhat weak IT procedures? In a nutshell, the network backup plan calls for a manual backup when we get a chance. Needless to say, our Novell Backup was aged two weeks, and unfortunately, a backup of NT had not been made.
The simple fix
We brought down Novell again, and I suggested that we replace the RAM with the old memory. I thought the new RAM might not like the hardware in the Novell Server. We continued with the new memory and still had problems. My boss called a consulting company, and one of the lead Novell Engineers was dispatched. After five hours of running volume repair, tweaking startup files, and utilizing DSREPAIR, the consultant agreed that we should, as a last-ditch effort, replace the RAM in Novell. Five minutes later, the server was up and fully operational!
Novell was fixed, but this was a “side-tracked” problem. The NT environment needed to be tested. My boss quickly attempted to log in. However, he was unable to successfully get on the network. We had discussed our options for recovery. Our plan was to rebuild the NT servers and let Novell’s NDS for NT populate the SAMs with user account information.
As my boss organized himself for the new server installation, I decided to log in to the server. I was successful! Further investigation revealed that the user account my boss was using restricted login to one specific workstation. He remembered setting this “option” a few days earlier during some testing. After we mopped up and verified that the network was fully functional, my boss and I discussed the whole situation. The problem could have been resolved by a simple reboot. How often do the little things cause the biggest problems?
Here’s a short summary of the lessons I learned:
- Develop and test a backup plan. The first month I was with the company, I developed and presented a backup plan using a jukebox method and Backup Exec software. This unbudgeted project was put on hold. I am glad to say that we are now moving forward with a proper backup strategy. We were very lucky this crisis was easily remedied.
- Schedule hardware upgrades during a known good environment. In our situation, we upgraded the good memory with bad. This error cost us valuable time.
- Communication is key. The Director of IT should have let the Network Administrator know he had altered our server's user account. In turn, I should have insisted on reviewing the account sooner. Better communication would have helped reduce the downtime. We will soon implement a server-logging system where any changes are documented fully.
- Walk away from a problem to take a deep breath, then return to the troubleshooting basics. We were so intent on fixing the problem that we neglected to first identify the problem. In the future, I'll make it a point to postpone my attack on any new problem for at least two minutes and dedicate this pause to evaluating the situation. Taking a deep breath will help you focus and will reduce the confusion of jumping into a technical fire without an extinguisher.
- Never anticipate that you can have a smooth day, get off on time, and enjoy hours of play online with Rogue Spear. Murphy’s law is inevitable. I must have jinxed myself!