If your organization maintains its own business-critical Web site, documenting and responding to outages is essential. Using a standard form to report outages can serve several purposes that all lead to better network management.
TechRepublic documents its Web site outages using a template designed by mySimon Director of Technology Operations Ken Nishikawa. Senior Web Operations Engineer Kent Langley reports that since adoption of the template and implementation of methodologies relating to the reporting of network issues, TechRepublic has dramatically improved its handling of outages. Communication among the operations staff responsible for maintaining the site has improved as well.
TechRepublic’s experience demonstrates that this approach can help organizations by:
- Improving communication.
- Establishing a reporting structure.
- Documenting issues.
- Establishing accountability.
We have put together a network outage report template you can download and use in your organization.
When TechRepublic became a part of CNET in 2001, it suddenly found itself part of a globally distributed corporation. The complexity of its network increased greatly, along with the challenge of establishing and maintaining solid lines of communication among the organization entities and individuals responsible for managing the merged network.
Langley said that staff members were sometimes confused about how and to whom they should report problems. Nishikawa stepped in by designing the outage form, which was intended to serve two purposes:
- Provide historical information documenting when incidents occurred
- Establish a standard means of communicating the issues
Both components played important roles in helping TechRepublic improve its management of network issues.
“The historical information ensures that we know to the minute what happened, and as a communication tool, the form also helps keep everyone on the same page,” Langley said.
But he also pointed out that the form itself wasn't a cure-all. The operations staff still had to put the reports into the hands of the appropriate individuals to take action on the issues.
“It didn’t happen overnight,” Langley said. There was a lot of trial and error before we figured out how to use it effectively.”
Once the staff sorted out the reporting structure, and everyone began following consistent procedures, the form became a useful tool for identifying problems and getting a prompt response from those responsible for correcting it.
Components and process
The outage report is actually distilled from what's called the problem action report (PAR). This is the initial report documenting the details of the outage. The outage report form is an edited version of the PAR that is forwarded to appropriate individuals responsible for the area in which the outage occurred.
Langley said it is often necessary to edit the language of the PAR so that less technically savvy managers and executives can better understand what occurred.
“We end up changing or removing some of the ‘techspeak’ for nontechnical folks. We just want to make sure that what gets reported is clear to everyone involved.”
Langley said the following series of events typically comprise the lifecycle of a PAR:
When an outage occurs, someone in the operations department is paged via the automated monitoring and notification system. Langley said that operations staff members are currently on a three-day rotation for being on call.
In the case a of service outage—where the Web site is actually down—the paged staff member completes a PAR form. If one of the load-balanced servers goes down, Langley said, availability and performance might be affected, but it won't cause an outage. In those instances, a PAR isn't completed.
On the PAR, the individual responding indicates the time at which he or she received the page, the nature of the issue, and the person notified of the outage. This record offers a concrete timeline of notification and response.
Langley said this documentation is valuable for a number of reasons. First, it provides key data for reporting purposes.
“Weekly status reports, for example, include data from PARs. We also base our uptime versus downtime percentages on PAR information.”
The PARs can also serve as an accountability tool by identifying who responded to an issue and when, as well as indicating where the problem occurred and who was notified.
“It ups accountability because it becomes a record of response time to issues,” Langley said.
And since the PARs indicate where the sources of problems actually exist, they help administrators pinpoint problem areas that affect the stability of the network as a whole and take action to correct the issues. Langley said that because the report points out specific areas, it gives those responsible a vested interest in maintaining their portion of the network.
Although Langley believes the report has done a great deal to improve the handling of site-related issues, he warned that incidents must be reported objectively.
“If it becomes an editorial on how you feel about Compaq servers, it’s not going to work.”
When used the right way, however, Langley said the report can be an effective tool that provides data for reports, records of responses to problems, and timelines of when issues occur. These records, coupled with the communication established among the teams responsible for maintaining the network, make the PAR form an invaluable asset.
To gain more control over outage reporting in your organization, download this template, based on the report Langley’s team uses.