Netflix open source FlameScope CPU tool helps developers debug performance issues

The new visualization tool instantly generates flame graphs from sections of system profiles.

Video: How to hire and retain great developer talent From startups and SMBs to the enterprise, the tech skills gap impacts every company. TechRepublic contributor Matt Asay explains how to locate, recruit, and retain the top-tier technology talent.
Building a slide deck, pitch, or presentation? Here are the big takeaways:
  • Netflix created the FlameScope tool to solve a latency problem in a microservice. The tool is being released publicly and features were added to make it more widely applicable to other use cases.
  • The tool would be indispensable for programmers and operators aiming to identify the origin of performance issues.

Netflix's cloud performance engineering team has released FlameScope, a performance visualization utility that allows programmers and system administrators to analyze CPU activity by generating a subsecond-offset heat map in which arbitrary spans of time can be selected by the user for further analysis by selecting a portion of the heat map, for which a flame graph is generated for corresponding block of time.

According to a Netflix blog post, this tool was originally developed to solve a particular problem at Netflix. A microservice was experiencing spikes in latency approximately every 15 minutes. After a correlation was found between the latency spikes and an increase in CPU utilization that only lasted a few seconds at a time, further troubleshooting was hampered by the difficulty of generating flame graphs for a problem that occurred in this frequency. A one-minute flame graph was too small to reliably capture the spike in CPU utilization, and flame graphs for longer periods were ineffective as the issue became indistinguishable amidst the normal workload, the post said.

FlameScope, in particular, automates the task of selecting ranges in a CPU profile for visualization in flame graphs. According to the post, this was the impetus for the creation of the tool:

I began by slicing it into ten second ranges, and creating a flame graph for each. This approach looked promising as it revealed variation, so I sliced it even further down to one second windows. Browsing these short windows solved the problem and found the issue, however, it had become a laborious task. I wanted a quicker way.

SEE: Comparison chart: VPN service providers (Tech Pro Research)

As this is the initial release, plans for additional features to be added in the future are underway. The authors are actively soliciting outside contributors implementing features and new ideas to make the utility more general-purpose—the project is written in Python and JavaScript/Node.js.

Presently, FlameScope only handles data from perf on Linux, though support for other profile sources is planned, the post noted. Additional interactive features, such as palette selection and data transformations, as well as the ability to export the resulting flame graph as an SVG are also priorities.

Netflix has a long history of releasing utilities developed internally for debugging and performance analysis as open source software. Netflix's Chaos Engineering concept and Simian Army suite of resiliency testing tools have been widely adopted inside and outside of technology firms. While Google, Amazon, Microsoft, Dropbox, and Yahoo have adopted Chaos principles in their operations, so have the University of California, Sandia National Labs, Fidelity Investments, and O'Reilly Media.

Also see

devs.jpg
Image: iStockphoto/FS-Stock

By James Sanders

James Sanders is a technology writer for TechRepublic. He covers future technology, including quantum computing, AI, and 5G, as well as cloud, security, open source, mobility, and the impact of globalization on the industry, with a focus on Asia.