Cavium ThunderX2 getting significant performance boost as glibc optimizations inbound

The GNU/Linux ecosystem is embracing Arm-based server processors, as challenges to Intel's hegemonic control of enterprise compute increase.

What will it take for Arm to challenge Intel in the datacenter

Optimizations are coming to the GNU C Library (glibc) for Cavium's ThunderX2 Arm-powered server CPU, as a recent commit changes the behavior of MEMMOVE in glibc 2.30, expected for release around the start of August. The commit, according to Cavium developer Steve Ellcey, provides improvements of "about 20-30% for larger cases and about 1-5% for smaller cases," and uses "SIMD load/store instead of GPR for large overlapping forward moves."

Differences in how SIMD (Single Instruction, Multiple Data) instructions are handled between Intel and Arm architectures--where the instruction type is called NEON--have been a primary pain point to adopting Arm-powered processors for servers. Cloudflare, which uses (now discontinued) Qualcomm Centriq servers, has worked on optimizing open-source applications in its technology stack for Arm architectures, and has published its results (and code) publicly.

SEE: Vendor risk management: A guide for IT leaders (free PDF) (TechRepublic)

A post in 2018 about optimizing jpegtran indicates the program was 1.3x faster in NEON than a comparable Xeon after optimization, though was only about half as fast as the same Xeon for the unoptimized program. This optimization process involves NEON instructions, and how gcc handes intrinsics on Arm.

Other optimizations in the update, noted by Linux performance benchmarking website Phoronix, include fixes to MEMCPY for overlapping backward moves, and using the existing version for smaller moves, as well as simplifying loop tails, using "branchless overlapping sequence of fixed length load/stores, instead of branching depending on the size," according to Ellcey.

The ThunderX2 is a 64-bit, ARMv8 CPU available in a variety of differing SKUs, from 16-core/1.6 GHz to 32-core/2.5 GHz, with eight DDR4 controllers for 16 DIMMs per socket, allowing for up to 4 TB of RAM in a dual-socket setup. Many ISVs offer ThunderX2-based solutions in a "4U in 2U" architecture, allowing for four dual socket servers in a 2U chassis, for increased compute density. ThunderX2 is also used to power the Mont-Blanc supercomputer project.

While this specific fix is targeted to the ThunderX2, increased visibility of Arm-powered CPUs is important for the health of the Arm ecosystem for enterprise computing. Amazon, through the purchase of Annapurna Labs, designed and released Arm-powered Graviton servers for AWS, challenging Intel's hegemonic control of the data center. Linus Torvalds recently praised Arm servers, but also claimed the economics and ecosystem are missing; SolidRun aimed to address those concerns by releasing a developer-focused Arm workstation, which is an accessible platform to test and optimize applications.

Edit: Steve Ellcey submitted the pull request, the developer responsible for the glibc changes is Anton Youdkovitch, a contractor for Bellsoft.

Also see

Image: James Sanders/TechRepublic