Networking

Avoid the most common culprits for single points of failure on small to midsize networks

Derek Schauland shares a recent hiccup on his network and how it spurred him to revisit areas where there could be single points of failure. Here are some common culprits to address in your business continuity planning.

In my organization — a small office — we use Active Directory. Until recently, the environment consisted of one local and one remote site with one Domain Controller (DC) each, providing service for about 65 users total and serving up everything from file and print services for both sites to e-mail. The remote location has about five users, and everyone is close by, so the single domain controller works quite well there.

Here at the corporate office, the remaining population of about 60 users is connected to a single domain controller. This setup faithfully plugged along, handling authentication and all the directory services we could ask for — until this week.

One morning recently when I arrived at the office, there were several users ready to let me know that they didn't have access to the services they needed.

My initial investigation revealed that DNS was nonfunctional. Also, the DC itself was very sluggish and seemed like it might need a restart. At first, the restart was to see if the kinks would go away and allow me to dig in to the issue further, but when the system came back up, everything picked up and the Active Directory load flowed again. People were able to log in, and drive mappings started working again. Because of Active Directory's heavy dependence on DNS, when DNS went down, everything else went with it.

Being a small shop by most standards, the idea of the single point of failure was there, but it didn't really seem like it could be a major problem. After all, we have a domain controller at the remote site and this should be quite sufficient. Well, this would be true if the link leaving the corporate office were faster, but trying to send replication traffic and additional requests for login over the WAN would have been a nightmare.

The restart got everything back online as quickly as possible, but I wasn't satisfied with knowing that under any heavy load, the issue could easily come back and take the organization offline. At first I thought about ordering a new server in order to get another DC set up, but even though servers are cheap, they're not free and don't materialize upon request, so I started to take stock of some of the other servers we have running in our environment.

One of these boxes used to run all kinds of things for the Web, but we moved those sites out to a host in the cloud to speed up access to them. Doing this left a server with a good amount of horsepower and not much work to do, making it a perfect candidate for our next DC.

Better performance with more infrastructure

Now that Active Directory runs on two domain controllers at our main site and both of them host the integrated DNS zone for our organization, the likelihood of a complete downtime has diminished. Also improved are the authentication for all users in the main site and access to resources here and on the Internet.

Outside of AD, I use Desktop Authority from ScriptLogic to manage the user environment, providing a one-stop place for printer and drive management and things of that nature. Since I was adding another DC to the directory, I also installed the Desktop Authority services there to ensure everything that typically processed during logon had no excuse not to attempt running when the users logged on.

In addition to getting another DNS Server/DC running on the network, I also added the role of Global Catalog to the new DC. This should allow for all aspects of AD to function continually if one of the DCs here were to go down.

Network areas that need particular attention

In many Windows environments, Active Directory plays a starring role and missteps in configuration or not planning for enough resources can bring things crashing to a halt. But there are other areas, even on a small or midsize network, that can become single points of failure if you aren't careful. Here are a few to watch out for:

Network Switches: Depending on the user count in an organization, keeping spare switches online might not be feasible; however, it is recommended to keep a couple spare switches around in case something happens to cause a failure. Tape Drives: Backup and recovery is fundamental in the IT world; without a good (and regularly tested) backup, the data in an environment is only as good as the weakest link. In my organization, I have two tape drives. We are small enough that one tape covers all the backup jobs, but in the event that one drive goes down, I do not need to worry about not being able to restore from a previous backup if there is a catastrophic event. Network Interface Cards (NICs): Most servers today ship with multiple NICs, which is good for both improved connectivity when using both and failover if one of the cards in a server (or other box) fails. Internet Connections: As dependent as society is on the Internet, having redundant connections, depending on the size of an organization and its business model, may be a key component in preventing a single point of failure. Smaller businesses outside of the technology industry may not be able to justify the cost of keeping a connection with two providers active, but it couldn't hurt to have a contact at multiple providers and possibly discuss what you would need to get up and running if your main provider were down.

The list I provided here is not all inclusive, but for most organizations these are things that should be considered in planning for the worst. Planning for redundancy will always seem like overkill to some people when things are working normally, but not planning for components to fail will surely result in those same people looking to you when there's unexpected downtime.

Lessons learned

This ordeal was a major one for our organization, even though it was cleaned up and corrected fairly quickly. I am glad I caught this when I did, but I will admit I wish I had gone the route of the additional domain controller prior to the outage. Doing so would likely have prevented this issue. Working in a one-man IT shop makes some of the tasks that need to get accomplished more difficult or likely to be postponed while you're putting out other fires. But the consequences of not planning for every contingency will always be worse than making the time to address single points of failure on your network.

Need help configuring, administering, supporting, and optimizing network infrastructure? Then turn to our free Network Administration Newsletter. Automatically sign up today!

About

Derek Schauland has been tinkering with Windows systems since 1997. He has supported Windows NT 4, worked phone support for an ISP, and is currently the IT Manager for a manufacturing company in Wisconsin.

Editor's Picks