Architecting around domains of failure

When high availability (HA) and disaster recovery (DR) do not adequately address your architecture requirements, where do you go from there? IT pro Rick Vanover defines "domains of failure" for architecting solutions.

When it comes to designing solutions of almost any type, we frequently design around levels of redundancy that go beyond hardware and software. Both high availability (HA) and disaster recovery (DR) are used to describe two levels of protection. HA generally focuses on a single host, server, application, or service through various tools. DR is broader reaching and addresses a site or entire location to be protected.

For the purpose of creating protection levels where HA is too narrow and DR is too broad, I've recently started using the term “domains of failure.” This basically defines a zone where an IT infrastructure is protected and to what extent.

One frequent example is to establish a domain of failure to be a rack. For a number of clustered servers, accommodating a one-rack domain of failure would place the servers across a cluster in multiple racks. This is done easily with multi-node clusters, such as VMware VI3 or vSphere as well as traditional dual-node Windows failover clusters.

Another popular domain of failure is switch connectivity. If a single switch failed, servers with multiple network interfaces connected to different switches (either different switch chassis or switch blades), yet on the same network, can accommodate a domain of failure of the single network switch.

Multiple power sourcing is another area where we can accommodate a domain of failure when following sources downstream. One of the things I have done in my practice when providing power to a rack is to deliver the power from two different sources in the data center. Further, each power source can run the entire rack independently. This makes sense if the power distribution unit (PDU) in the rack fails or becomes disconnected from the source, but what if both PDUs get their power from the same source? Arranging the right power to accommodate a primary power source domain of failure can add extra protection.

Determining boundaries for domains of failure can be very complex, and everyone’s requirements will vary. How do you address domains of failure in networking, power, and equipment? Share your comments below.