Networking

How one admin recovered from a router disaster

Managing routers and wide area network links can be among the most challenging and high-pressure duties in IT. In this installment of From the Trenches, we'll learn valuable lessons from the way one administrator approached a router disaster.

When routers and wide area network links go down, entire departments and branch offices can be cut off from mission-critical systems. Thus, managing WAN circuits and equipment can be among the most stressful and demanding duties in information technology. We’re going to extract some important lessons from looking at how one analyst dealt with a router disaster.
You can learn quite a bit by reading about the methods other network administrators use to resolve challenging technology issues. Our hope is that this column will provide you with valuable techniques for honing your problem-solving skills. If you have an experience that would be a good candidate for a future From the Trenches column, please e-mail us. All administrators and their companies remain anonymous in this column so that no sensitive network information is revealed.
In this article, we’re going to follow Susan, a systems administrator who manages routing and switching, WAN links, IP subnetting, and network equipment for a midsize commercial company. As we see how Susan approaches and mitigates the router disaster, look for these key points:
  • What steps does she take to verify the problem? What are the initial methods she uses to attempt to resolve the issue?
  • At what point does she contact the vendor about the problem? What kind of information does she prepare for the vendor before calling?
  • How does the way she’s forced to fix the problem on the live network compare with how she usually prefers to implement these kinds of changes?
  • Once the problematic variable has been found and resolved, what does she do to ensure that the issue does not affect the same router or other routers in the future?

The router problem surfaces
At 4:30 P.M. on a Tuesday afternoon, Susan received a call from an employee at a remote office that was having problems accessing information from the corporate network. From the corporate headquarters, Susan used Telnet to get into the Cisco VPN router that connects from the corporate office to the remote office in question.

She tried to ping the remote office but got no response. From the Cisco IOS command line, she went into the network interface for the router’s link to the remote office and turned the interface off and then turned it back on. (She used the Cisco “shut” and “no shut” commands.) Immediately, some traffic started to go through so she sent 100 pings to the remote office router in order to get a better measure of what was happening. However, she received responses on only 20 out of the 100 pings. “It was very, very sporadic,” Susan said.

She checked for error messages on the console and looked through the router logs but couldn’t find anything that indicated a problem. Since some packets were making it across the WAN link, she knew that the link itself must be okay. Therefore, she suspected that the problem was with the IPSec encryption between the two routers since they were running a VPN connection over the WAN link. She ran some debugging commands (“debug crypto ipsec” and “debug crypto isakmp”) but didn’t get any more information to help her resolve the issue. She even attempted to reload the router, but that did not fix the problem.

The router had a similar problem the week before. “We had lost connectivity and it was like the encryption had just shut down, but there were no error messages,” Susan said. At that point, the router had completely locked up, and Susan was forced to power it off and bring it back up with a hard boot. Then, she said, “It reestablished everything and was a happy camper.”

Nevertheless, Susan was suspicious about this so she went to Cisco’s Web site and downloaded an upgraded version of the Cisco IOS for the router. As fate would have it, she had planned to upload the new version of the IOS at 11:00 P.M. on the same day the router went crazy and broke the connection.

Getting tech support from the vendor
Since a reliable VPN connection could not be reestablished between the corporate router and the remote router, Susan decided to call Cisco technical support. She suspected that there was a problem with the router hardware, the version of the IOS, or with the router’s configuration.

This was the second time this issue had happened involving the same two routers, so Susan felt confident that this needed to be addressed by Cisco, “because at that point, you know something’s wrong,” she said.

Before calling Cisco, Susan captured all of the information on the configuration, the hardware, and the buffers for the corporate VPN router. (She issued the “show tech support” command.) After capturing this to a text file that she could e-mail to a technician, she contacted Cisco and opened a case with them.

“Cisco called back and said, ‘We want you to upgrade to this IOS.’ Apparently, a couple days before, they put out a field notice that the IOS version I was running [on the corporate VPN router] was buggy and needed upgrading,” Susan recalled.

Cisco also recommended that Susan make a configuration change to the transform sets that manage encryption between the routers. The transform sets function in a hub-and-spoke method, with the corporate VPN router being the hub and all of the branch office VPN routers being spokes that connect it.

Susan had created separate transform sets between the corporate hub router and each of the individual remote office routers. However, Cisco said that she needed to use separate transform sets only if different encryption types were running between the different routers. Since the routers were all utilizing the same encryption type, Cisco said that only one transform set was needed.

Resolving the issue
Once Susan got the recommendations from Cisco, she set about trying to resolve the problem. She started with upgrading the version of the Cisco IOS. “I put the new IOS on, but it was still doing the same thing,” she said. When that didn’t resolve the issue, Susan knew her next step would be much more drastic.

To make the configuration change to the transform sets that Cisco support recommended, Susan would have to rebuild the configuration on the corporate hub router and all of the remote office routers. At that point, the one branch office was still having a network-down emergency so she had to make the change right away. Nevertheless, all of the other branch offices and the corporate office had users on the network who were working fast and furious on end-of-the-day deadlines. She would have to make the configuration changes while all of the routers were live.

It took Susan two hours to go into each router and rebuild the configurations with the changes to the transform set. Much to her relief, this resolved the issue for the branch office that had connectivity problems.

However, making this kind of configuration change on live routers was less than ideal. One mistake in syntax or one wrong command could have knocked a lot of users off the network during a busy time. “I was trying to do it with minimal effect and minimal downtime,” Susan said.

“Normally, I would build the new configuration offline and then load it on the router,” she added. “Ideally, you would have some kind of test environment where you can check the configuration and then propagate it when you know it works.”

Once Susan resolved the issue, she didn’t simply close the book on it and move on. She considered what could have caused the VPN connection to fail to the point that the remote office lost access to the corporate office network. She tried to think of any recent changes that could have instigated the problem.

Since she knew that making the change to the transform set was what fixed the problem, she focused on that issue and recalled that the day before the problem occurred, she had added another transform set for an additional remote office. “That may have been the straw that broke the camel’s back,” she concluded.

Susan left the case open with Cisco and plans to check back with the company to see if there’s a maximum number of transform sets for the routers she works with. That way, if she needs to add more transform sets in the future, she can work to avoid the same problem.

Insights gained
We can learn several key lessons from Susan’s router recovery. First, she analyzed the problem fully before taking any actions to fix it. Then, when she determined that she needed to call the vendor, she gathered all the information she could on the issue and was well prepared to answer any questions that a technician might have. When she was forced to make configuration changes to live network equipment, she was very careful to minimize the effect on users and to avoid any syntax errors in her commands. Once she had the problem resolved, she did not simply move on from there. She looked back and tried to establish the root cause of the issue so that she could preempt any future problems.

Insight can also be gained from something that Susan could not do in this case but would have preferred to do. As she mentioned, the kind of configuration changes she had to make to her routers would have been best done in a test environment first. Then, the configuration could have been verified and optimized on testing equipment and loaded to the live routers with confidence and quality assurance. Clearly, this is the preferred method when you don’t have a network-down emergency.
We look forward to getting your input and hearing about your experiences regarding this topic. Join the discussion below or send the editor an e-mail.