When it comes to network design, redundant links can provide a certain level of fault tolerance. If one link fails, it’s good to know that you’ve got a backup link that will carry the traffic. Unfortunately, these very same redundant links can at times cause difficulties on your network. If there are two Layer 2 paths from a source to a destination, the possibility of a loop will always exist. In a bridged or switched LAN environment, a bridging loop occurs when there are multiple paths that can cause packets to continuously loop around your network. This can cause a severe drain on available bandwidth, and it can cause your network links to become intermittently unreliable. It is this intermittent nature that causes difficulty in troubleshooting the problem.
In an effort to address bridging loops, the spanning tree protocol was developed many moons ago. The purpose of spanning tree (STP) is to identify and shut down redundant Layer 2 paths. Problem solved, right? Not necessarily. Although this protocol is activated by default in some vendors’ switching devices, each has its own variations. And regardless of how stable and mature many network vendors will tell you their implementation is, problems can still happen. And what happens if STP malfunctions? Each time there is a change in topology, STP must converge or recalculate the tree. The time it takes to complete this process will vary depending on the size of your network. For argument’s sake, we’ll assume it’s about 90 seconds. During the period while STP is converging, no frames are forwarded. Essentially, your network comes to a halt. And if the problem isn’t eliminated, your network can go through a constant cascading process where STP converges over and over again.
Here, we’ll focus on what measures we need to take to get the network back into operation as quickly as possible. Keep in mind that some of these techniques may temporarily cause more disruption than the bridging loop itself. If possible, you may want to consider scheduling downtime during off hours to minimize impact on your users. After each step, you should be ready to reverse the last action and fall back to the previous configuration.
First of all, it’s a good idea to keep an up-to-date network diagram close at hand. The diagram should show at the least which network devices are interconnected and on which ports. It’s also a good idea to label the root bridge on your diagram. If STP does malfunction and a loop occurs in the network, you may be looking at an emergency situation. The immediate and obvious fix is to remove the problem at the source. Identify the redundant links and physically remove them. Here, we’re assuming that redundant links were intended in your network design and implementation. Although you will now be operating without that redundancy, chances are that STP will now converge, and your network will stabilize. This is all well and good in a small, well-documented network, but what can we do in a larger, more complex network environment? What if the wiring closet is a nightmare of unlabeled spaghetti? And what if our wiring diagram was most recently updated in the last century or even worse, we have no diagram? Fortunately, the Cisco CDP protocol can tell us what we need to know. This command will display all known Cisco devices that are connected to any given switch and on which port:
show cdp neighbor
By executing this command at each switch, we can quickly build an ad hoc diagram showing what links currently exist between which devices. If you’re lucky, the loop won’t be so bad as to limit your access to the switches via a telnet session across the network. If the network is inaccessible, you’ll have to physically access the switches via the console port. Once you’ve identified the redundant link, you can physically disconnect that cable. Or, if you’re accessing that particular switch directly or across the network, you can always shut down the port.
As you progress through the troubleshooting process, you’ll want to pause after each step you take to see if the last action taken produced the anticipated result. You can view the current status of STP with the following command:
If spanning tree is active but still unstable, you will need to continue troubleshooting. At this point, if you are certain that you have removed any redundant paths, you might want to consider physically restarting the devices between which the redundant links were found. This is not a task to be taken lightly. If the switches in question are located at the core of the network, you might want to schedule an after-hours operation. However, if the STP problem and resulting bridging loop are causing such havoc that the network is essentially useless, you may have no other choice. You’ll want to weigh the severity of the problem and act accordingly.
If you can’t identify the STP problem in short order and spanning tree has not yet stabilized, you may want to consider disabling STP until you can schedule some after-hours maintenance time. I realize that this may not be an option in some larger, more complex networks, but if STP continues to cascade outages through your network even after you have removed all redundant paths, this can quickly become a very viable option. To do so, perform the following on all switches involved:
set spantree disable all
If it works, the network will return to a production operating state at this point. If the problem does lie with STP, this will allow for breathing room to further research the problem and formulate a plan to reimplement spanning tree and link redundancy on your network. In the next part of this series, we’ll delve a little deeper into techniques that will help us avoid, as well as resolve, STP problems and the network-crippling loops that can result.
The command examples given here assume the use of Cisco switches using the SET command interface. You’ll want to check your documentation for commands specific to your switch model.