How to consider service management in high availability design

A common omission in designing a highly available infrastructure is tying the configuration back to service management. Keith Townsend provides some service management recommendations.

Image: iStockphoto/Spectral-Design

Enterprise architects and engineers spend a great deal of time to ensure a sub-system is highly redundant. By ensuring each sub-system is redundant, there's a better chance that the entire infrastructure is redundant.

Highly redundant infrastructure is key for high-availability application requirements. However, poor consideration of service management is often the source of failure for these highly available system designs.

Server high availability options

While storage and network availability are just as important, server availability is a good place to hold up as an example of when service management can fail. With all the talk of software-defined infrastructure, the VM is still the building block of existing applications. Developers and application owners still deal with server availability. In examining server management impact on high availability, I'll use server availability as an example. Keep in mind the same concerns exist with any data center sub-system.

A standard approach to high availability is to leverage operating system clustering. Operating system clustering allows operations teams to migrate processes from one operating system instance to another, with little to no disruption. Clustering adds a higher level of availability over features that automatically restart an OS's instances. An example of a function that restarts servers versus migrating processes is VMware's HA. In the event of a hardware failure, vSphere's HA feature will restart a VM on another physical host within the cluster.

Protection from OS and hardware failure isn't the only advantage of clustering. Both VMware's HA and traditional OS clustering allow for greater operational flexibility. In the case of VMware vSphere, operations teams can evacuate a physical host to perform hardware patching and firmware updates without downtime. In the case of OS clustering, operating system patching results in no application downtime.

The service management gap

Vendor technology helps cover the technology side of high availability. The challenge is when the technology meets governance and processes. A simple example is identifying the active node of a clustered system. In a solution that leverages a standby node, operations teams can apply updates, maintenance, and system scans on the standby node. To perform these actions in a non-disruptive manner, the group performing maintenance needs the ability to identify the standby node.

In smaller organizations, identifying the active node isn't a challenge. The team performing maintenance may be the same team administering the cluster. In larger environments, however, this becomes a problem. Centralizing the details of the cluster and status in your service management platform is a logical solution. There's also the option of having detailed instructions on determining which system is the active node. The challenge with offline instructions is the difficulty of providing the access and instructions across a heterogeneous OS environment. The instructions for one flavor of Linux may be different than another flavor of Linux, and completely different than Windows Server.

Even at the VM level, there are challenges. VMware vSphere provides affinity/anti-affinity controls to avoid having clustered OSes running on the same host. These controls don't span vSphere Clusters. So, there's nothing stopping a capacity management team from migrating clustered VMs from one vSphere cluster to the other hosting its pair. Again, a potential solution is to annotate the information in the service management platform and ensure governance includes checking the service management system before applying VM migrations.

Service management may seem like one of the less interesting parts of high availability, but human error breaks great technical design without proper service management.

What's your story?

Do you have horror stories to share from when your high availability broke due to human error? Share your experiences in the comment section below.

Also see