The first time I ever broached the subject of network downtime with an employer, he was horrified: "What do you mean? I thought this was supposed to run 24 hours a day, everyday!"
It took me longer than it should have to calm him down and explain that it was quite normal for the system to occasionally close down. I compared it to the regular servicing required by his car or central heating boiler. In retrospect, I realized, it would have been wise to have mentioned this subject before it became a necessity.
Whether it’s for regularly scheduled maintenance or to deal with an unexpected snafu, the network has to be down at some point. The important thing is to make sure that company employees are aware of the possibility of downtime and that, in fact, they should occasionally expect it. Here are my suggestions on managing users’ and executive managers’ expectations for both regularly scheduled and unplanned downtime.
Regularly scheduled downtime
If it proves necessary to close the system for maintenance, you’ll want to discuss the timing of the shutdown with all departmental heads to try to reach some form of agreement. You usually won't be able to please everybody, but you’ll want to be sure that the impact is contained as much as possible.
It might be necessary to perform maintenance after normal working hours. For example, if your office runs 9 A.M. to 5 P.M., it may be best to schedule downtime for the early evening. That way, you can fix any problems and get things up and running again before the morning. If your offices operate 24/7, there will never be a good time to perform maintenance, so talk to your users and negotiate a time.
When you have decided on your maintenance time period, make sure you have everything planned meticulously so that the whole operation is as smooth as possible, and you are up and running again before your scheduled time slot has expired.
It can be costly and humiliating to shut the network down for some routine work only to find that, at the end of it, you are unable to restore the service because you forgot something. Therefore, your preparations should involve a test rollout of the planned upgrade. Use this test to assess the impact the upgrade will have and to ensure that it has no unforeseen side effects. It’s also important to document the effects of your test rollout. I have found it very useful to keep a server diary in which I note everything that has altered on the system.
Send multiple notifications
You’ll want everyone to have notice of the event well in advance, so send out e-mail reminders one week before the event and again one day before the event. Then, a few minutes before the shutdown, send a system message to all logged-on users to make them aware that they have five minutes to save their work and log off. Keep a log of all messages that you send out—inevitably, someone will always say they weren't told.
Avoid embarrassment by checking the obvious
The first server I ever administered was a Dell running NT4 Server. It was working well when I was asked to move it from the main office to a secure area. At first, the job seemed fairly simple.
There was no network point in the archive room where it was to live, so the first thing I had to do was arrange one. This entailed simply drilling a hole through a partition wall and running a cable through it. Having ensured that my new point was live by plugging a desktop system into it, I arranged with the rest of the company the best time to move the server.
As it turned out, the entire company was in a meeting discussing their next research project, so I had free run of the building and the network. They were having lunch sent in, so I figured I had plenty of time to allow for disasters. I unplugged the server screen and moved it to the new area.
I piled the keyboard, mouse, and UPS onto the server case, unlocked the wheels, and removed the power plug. The server was still active, running from the UPS, which was set to run the system for 20 minutes before closing down. In no time, everything was plugged back in, and I ran around the office to make sure that I could see server volumes on the desktop machines.
"Great," I thought, "a job well done."When the rest of the workforce returned, however, it turned out that the messaging services had stalled, and nobody could send e-mail. I took a few more minutes to restart the services and kicked myself for not thinking to check it. With any luck, that particular scenario will not occur again, since I recorded it in my server diary and added it to the procedure for similar operations.
Have a contingency plan
It's possible that your work may overrun the time constraints allotted, and that you’ll need to retreat from your efforts to let users back on. Be sure that it’s possible to roll back to the original system state. But, before you abandon a job that’s nearly complete, try some on-the-fly negotiations with the user base. They may be happy to stay offline for another hour if it prevents another shutdown in the near future.
It's important to know what the options are and what the point of no return is. Thankfully, I've never gone past it. By employing a strict if-it-ain't-broke-don't-fix-it policy, I've managed to keep things reasonably functional. Any work I wasn’t sure about went onto the test machines for evaluation before I implemented it, and I also created an additional backup. In a pinch, it would have been possible to plug my test server onto the live network to replace the main machine.
In any event, you should build a margin of safety into any scheduled downtime slot. If you come in ahead of schedule, your team will feel good about it, and the user base will also think you have done well. If your estimates are too "realistic," you will have to live up to them or risk losing the confidence of both your team and the users.
Let's face it. No matter how careful you are, equipment will sometimes malfunction or break. Cables will be damaged, power supplies will fail, hard disks will shuffle off this mortal coil, and processors will burn out. The key is to make sure that any unexpected outages are dealt with in an expedient and professional manner.
The most important thing to do in this event is to communicate. Don't get so deep into mending the problem that you fail to tell the rest of the world what has happened.
After discovering that there’s a problem, your first step is to call the heads of all affected departments. It's wise to have a list of these people prepared in advance. Tell them that you're aware of the problem and are investigating.
Once the line of communication is established, you may choose to maintain it through the help desk. Get the help desk involved—and keep them informed—by giving them regular reports so that they can pass the information on to department heads. There are several good reasons for doing this.
First, customers need to know what is happening. If they don't hear anything, they don't know what is going on—for all they know, you and your team could be in the local pub enjoying yourselves. Let them know what has happened as soon as you can. Then, tell them what you are doing to fix it and provide a rough estimate of how long the fix will take.
Further, proactively continuing the flow of information will reduce the number of calls to the help desk and, when users do call, the help desk staff will have the latest information available to pass along. And, if you keep the affected departments apprised of your progress, they will be able to better plan their work to avoid using the network. They may be able to attend to other tasks if they know they will have an hour or two without the system.
Communicate, communicate, communicate
No matter how you slice it, the way to deal with network downtime is through frequent communication. If possible, let people know in advance that there will be an outage. If you have no warning, keep users up to date on your efforts to fix the problem. Your reward will be fewer complaints and a thankful user base.
Running and administering a network is a constant learning process. By making sure that every incident is logged and acted upon, you can ensure that the next time you perform a task it will be easier and have less impact.
Do you have an innovative communication plan for downtime?
Have you developed a system for notifying management and users of network downtime that’s helped better manage their expectations? Tell us about it in an e-mail or post to the discussion below.