Hardware

How to optimize VM memory and processor performance

Some of the techniques used to build highly scalable servers can create an unintended performance problem for VMs;one of these is NUMA node balancing. Colin Smith provides a high -evel overview of the problem and some of the ways to address it.

The highly scalable server architectures available to modern datacenters have achieved unprecedented memory and CPU densities. As a result, VM density has also increased.  Some of the techniques used to build highly scalable servers can create an unintended performance problem for VMs. One common problem is NUMA node balancing. In this post, I'll try to provide a high level overview of the problem and some of the ways to address it. Not all hypervisors deal with NUMA node issues in the same way so I have kept this post hypervisor neutral. Specifics for your virtual environment are best addressed with your vendor.

What is NUMA memory?

NUMA (Non Uniform Memory Access) hardware architectures use multiple memory buses to alleviate the contention issue in multi-processor systems. This provides a huge scalability advantage over the traditional SMP (Symmetric Multi-Processing) model when large numbers of processors are required. The architecture maps specific processors to specific high-speed buses connected to specific pools of memory. These form a NUMA node. Memory in the same NUMA node as the processor is considered local memory and can be accessed relatively quickly. Memory outside of the NUMA node is considered foreign memory and takes longer to access.

In the diagram above, VM0 will be fine as each core will have sufficient local memory available.  VM1 should never get assigned cores in different NUMA nodes because a NUMA aware hypervisor should only assign a VM to a single NUMA node.  VM2 will have NUMA memory fragmentation that could affect performance because there is insufficient local memory to satisfy the 12GB requirement.

In some cases, VMs will perform better on servers with less physical CPUs and the same amount of memory since each NUMA node will have more local memory.  Compare a 4 processor 32GB system where each NUMA node has 8GB of local memory to 2 processor 24GB system where each NUMA node has 12GB of local memory.

How does this affect VMs?

If a VM uses memory that is not part of the same NUMA node it may have performance issues when foreign memory is required.  If you have different amounts of memory in different NUMA nodes, this can be a problem if VMs are randomly distributed across nodes.  Fortunately, modern hypervisors are NUMA aware and try to assign VMs with high memory footprints to nodes with more local memory.  There is also the option to assign a NUMA node affinity to a VM. This overrides the hypervisors' dynamic assignment of VMs to NUMA nodes.

Some problematic scenarios

Consider a series of dormant VMs that have NUMA affinity assignments. When they spin up, they will be assigned to the NUMA node that is designated in the affinity setting. If too many VMs are assigned to the same NUMA node, there is the potential for processor resource contention within a single node while other nodes are underutilized. Additionally, the ability to overcommit memory can exacerbate the issue in some situations. What if the memory footprint of a VM is larger than the memory in the NUMA node?

The solution

There is an art to balancing the NUMA node memory and processor requirements so that VM performance is optimized. A large part of that is having a good understanding of the workloads that your VMs are running and what the impacts of poor performance might be.

In my previous post, I indicated that VM and hypervisor aware monitoring is important to get a true picture of VM and host performance.  It is situations like NUMA affinity that traditional performance monitoring tools have trouble addressing. These are the types of scenarios that a new breed of performance metrics helps to manage. Simply monitoring the hosts and VMs independently is not sufficient. You need to ensure that you understand the issues, that you have instrumentation in place to provide adequate telemetry, have thresholds and trigger points defined, and most importantly the ability to react when they are reached.

About

Colin Smith is a Microsoft SCCM MVP who has been working with SMS since version 1.0. He has over 20 years of experience deploying Microsoft-based solutions for the private and public sector with a focus on desktop and data center management.

6 comments
bpr
bpr

Thanks for the article. Do you know of any tools to test and confirm that unbalanced memory > CPU layout is causing performance problems.

techn0gichida
techn0gichida

I have a question about running 16bit processes in a VM. Programs like wowexec seem to have problems. Is there any setting for getting these type of programs to play better?

The Colin Smith
The Colin Smith

Great question. I've looked and can't find anything specifically targeted at this. Let me know if you find one, It really depends on how your hypervisor works with NUMA architectures and how your hardware vendor implemented the memory interconnects (among other things). It will also depend on your specific workload, how often foreign memory is required and the interconnect latency. It isn't hard to prove though. Ultimately it will manifest itself as a memory latency issue (or an asymmetric processor bottleneck) s any monitoring tool that can identify the symptoms. It's still up to you to uncover the root cause. Here is a link to some NUMA monitoring done on Hyper-V. It really just shows the NUMA assignment and foreign memory. I'm sure there are similar ones for ESX. http://blogs.technet.com/b/winserverperformance/archive/2009/12/10/numa-node-balancing.aspx

jmarkovic32
jmarkovic32

Even on a physical machine running 16-bit apps would be slow. The reason why is that Windows, for example, will basically run the 16-bit app in a virtual machine. Add hardware virtualization onto that and you're basically running a VM inside a VM. You're basically asking for performance problems. Have you tried running the 16-bit app on a 16-bit guest OS like DOS?

The Colin Smith
The Colin Smith

The point of the post was to expose a specific set of conditions that could impact VM performance. So you use it in the following ways: 1 - understand the issue 2 - size physical systems appropriately (2 sockets 24 GB can outperform 4 sockets 36GB) even though that is counter-intuitive. 3- be careful when overriding the hypervisor's automatic NUMA assignment when assigning NUMA affinity manually.

Editor's Picks