In the computing world, it’s imperative to build some level of fault tolerance into the systems we design. We often hear about redundant power supplies in servers, RAID arrays for fault-tolerant storage of data, and even clusters of servers to provide for disastrous contingencies. However, another type of fault tolerance that we don’t hear about nearly as often is network redundancy. That may be because of the cost of redundant network devices or a lack of perceived need. But what better place to build in fault tolerance than at the network level. The network is the conduit through which all of our data passes. Server redundancy may guarantee data integrity and continuity of service in the case of component failure, but this is all for naught if the network by which we access our servers fails. So, in terms of network fault tolerance, how and where do we implement it? In this Daily Feature, I’ll show you how to add network redundancy by configuring multiple routers/switches with virtual router addresses. To do this, I’m going to enlist the help of Cisco’s Hot Standby Routing Protocol (HSRP), which virtualizes the router address.
Forms of network fault tolerance
Let’s look at redundancy in somewhat of a vacuum for a moment, narrowing the focus of the discussion to a network scenario with a single switch/router at the center of the LAN. It’s not a large network, so as often happens, the core, distribution, and access layers are consolidated into one device, for instance, a Cisco Catalyst switch with a layer 3 routing module installed. All the clients and servers currently connect to this switch, where they are divided into three VLANs. The VLANs are routed by the layer 3 module. Internal routing must take place for the VLANs, as the clients and servers are located on different VLANs. Therefore, a single point of failure for this network scenario is the layer 3 module in the switch. If it fails, I’m incommunicado. For that matter, I’m not any better off if the switch itself fails. In an effort to reduce downtime, I need to consider these risks.
To ensure layer 3 redundancy, I could install another routing device in the network. But that still affords no redundancy for the switch at layer 2. A better option is to add another switch/router just like the first. In this manner, I can provide maximum network redundancy so that neither a switch nor a router failure would cripple the network completely.
Now that I have the equipment in place, how do I ensure that failover works from one device to the other? First of all, I’ll want to define all VLANs on both switches and a layer 3 interface for each VLAN on the routing module of each switch. Then, the two devices must be connected. In this way, if either switch failed, only the clients directly connected to that switch would be adversely affected. If either router module failed, the other would still be available to facilitate network routing.
How this failover works
Now that I’ve laid the groundwork by installing redundant equipment, let’s take a look at how this failover process might work. To do so, let’s say that one of the router modules fails. Assuming the workstations (and servers) connected to that router/switch use that router address as their default gateway, what will happen when it fails? Naturally, none of those workstations will be able to access the server resources on another VLAN. Sessions will time out at the workstation, and error messages will pop up all over the LAN. But I have a redundant router module in place. Why didn’t it failover?
In most cases, we can view network fault tolerance from either the workstation or the network level. That is to say, where does the fault-tolerant process take place? In this scenario, the workstation needs some way of switching default gateways to the one that is still active. And it’ll need to do so quickly or applications being accessed may time out. Some versions of Windows do provide for secondary default gateway settings, but how long will it take to switch to the secondary?
Another thing to consider is the routing protocol running on the router modules. For instance, if I were running the OSPF routing protocol on the routers, then one router would know very quickly when another router failed. But how do I convey this routing information to the workstation? One option is to actually run the routing protocol on the workstation itself so that it can participate in the routing process. Unfortunately, most network client software doesn’t offer or support routing protocols that would allow this level of failover functionality. So rather than place the burden of redundancy on many workstations (that would all have to be configured correctly), wouldn’t it be better to embed this type of intelligence into the network itself? The answer is yes. And Cisco provides just such a feature with HSRP.
HSRP works by virtualizing the router address. You may have several physical routers involved in the HSRP process, but only one at a time can be actively routing. This active router receives all network communication for the virtual router address. If it fails, the next router in turn takes over the routing function for the virtual address. The advantage of HSRP over other methods of fault tolerance is that the failover process happens very quickly. HSRP uses standby groups to achieve this. All routers in a particular standby group are potential candidates for taking over the active routing role. Essentially, each router in a standby group communicates with the other routers in its group to determine when a failure occurs. When it does, HSRP switches to the next router in the group for routing functions for the network virtual address. HSRP is not simply router-based; it’s also based on the interface addresses of the routers. So, in a VLAN or multiple subnet environment, we can create an HSRP standby group for each local subnet.
The following is a partial configuration based on the aforementioned switch/router scenario. It illustrates a VLAN-based implementation on Cisco Catalyst series (CatOS) switches with internal layer 3 routing modules installed and running version 12.x of the IOS.
!Router module 1
Ip address 172.16.1.2 255.255.255.0
Standby 10 ip 172.16.1.1
Standby 10 priority 125 preempt
Ip address 172.16.2.2 255.255.255.0
Standby 20 ip 172.16.2.1
Standby 20 priority 125 preempt
Ip address 172.16.3.2 255.255.255.0
Standby 30 ip 172.16.3.1
Standby 30 priority 125 preempt
!Router module 2
Ip address 172.16.1.3 255.255.255.0
Standby 10 ip 172.16.1.1
Standby 10 priority 100 preempt
Ip address 172.16.2.3 255.255.255.0
Standby 20 ip 172.16.2.1
Standby 20 priority 100 preempt
Ip address 172.16.3.3 255.255.255.0
Standby 30 ip 172.16.3.1
Standby 30 priority 100 preempt
Here, each VLAN has been configured with its own standby group. I assigned the VLAN interface an address and then designated the HSRP virtual address for that particular VLAN. Then, I set the priority, which controls the failover order for the standby group. Notice how the addressing and numbering somewhat coincide between VLANs, interfaces, and standby groups. This is not required but makes the configuration and network design easier to follow. In this configuration, router 1 is the active router for all VLANs, as the priority is higher. If it fails, HSRP will cause routing to failover to router 2 almost instantaneously.
Granted that this is only one way to skin the network redundancy cat, it’s a simple and productive method to provide fault tolerance. RAID and tape backup may provide that type of protection for your servers, but if you’re looking for a very inexpensive way to add a bit of redundancy to your servers, HSRP is the way to go.