Security

Meltdown fix's 'massive overhead' will slow Linux systems, warns Netflix engineer

Brendan Gregg describes the impact of updates to the Linux kernel that work around Meltdown as demonstrating the "largest kernel performance regressions I've ever seen".

Building a slide deck, pitch, or presentation? Here are the big takeaways:
  • Changes to the Linux kernel to mitigate the impact of Meltdown have been found to slow systems, due to a performance overhead of between 1% and 800%.
  • The performance of systems that use a large number of syscalls or have high page fault rates is particularly severely affected.

A Netflix engineer has warned of the potentially "massive overhead" of patching Linux-based systems against the Meltdown CPU flaw.

Brendan Gregg found that updates to the Linux kernel to mitigate the risk from Meltdown added anywhere between 1% to 800% overhead, depending on the nature of the workload.

Meltdown and Spectre are vulnerabilities in modern chip design that could allow attackers to bypass system protections on nearly every recent PC, server and smartphone, allowing hackers to read sensitive information, such as passwords, from memory.

Describing the impact of the kernel page table isolation (KPTI) patches that work around Meltdown, he said they resulted in the "largest kernel performance regressions I've ever seen".

"Where you are on that spectrum depends on your syscall and page fault rates, due to the extra CPU cycle overheads, and your memory working set size, due to TLB flushing on syscalls and context switches," he writes, going on to assess the likely impact on Netflix's AWS-based systems.

"Practically, I'm expecting the cloud systems at my employer (Netflix) to experience between 0.1% and 6% overhead with KPTI due to our syscall rates, and I'm expecting we'll take that down to less than 2% with tuning."

The severity of the performance impact of the KPTI patches is determined by:

Syscall rate: The impact climbs with the syscall rate, with Gregg estimating at 50k syscalls/sec per CPU the overhead may be 2%.

Context switches: The impact scales in a similar fashion to the syscall rate.

Page fault rate: High rates will again add overhead.

Working set size (hot data): More than 10MB will overhead due to TLB (translation lookaside buffer) flushing — potentially turning a 1% overhead into a 7% overhead.

Cache access patterns: A workload that switches to access patterns with less efficient caching can, worst case, suffer a 10% performance overhead.

SEE: Incident response policy (Tech Pro Research)

kpti0mc59xl41112.png

The impact on performance of the KPTI patches as the number of syscalls increases.

Image: Brendan Gregg

To reduce the impact of KPTI on Linux-based systems, Gregg recommends a series of measures: including using 4.14 with PCID support, huge pages (which can also provide some gains), and syscall reductions, which he outlines in more detail here.

Gregg added that the actual performance impact of protecting Linux-based systems against Meltdown and Spectre would be higher, as the changes to KPTI are part of a series of updates that guard against the vulnerabilities. Alongside these, there have been Intel firmware updates, cloud provider hypervisor changes, and Retpoline compiler changes — all of which would likely have further impact on performance.

In the hurry to issue patches there have been multiple instances of Spectre- and Meltdown-related updates causing instability and performance issues in computers — particularly Intel firmware updates for variant 2 of the Spectre flaw.

Intel is working on a new design for its processors to mitigate the threat posed by the Spectre and Meltdown vulnerabilities, according to CEO Brian Krzanich, as is AMD to reduce the risk from Spectre.

IBM has also released patches for Meltdown and Spectre for systems running on its Power family of processors. The range of operating system and firmware updates is available via IBM, although due to Power4, Power5 and Power6 series systems being out of support, these systems will not be patched by IBM.

Read more

About Nick Heath

Nick Heath is chief reporter for TechRepublic. He writes about the technology that IT decision makers need to know about, and the latest happenings in the European tech scene.

Editor's Picks

Free Newsletters, In your Inbox