The person who became known on the Internet for yelling at servers is now becoming famous for another, somewhat related, feat, creating a new type of data visualization for characterizing system performance.
Brendan Gregg, lead performance engineer at cloud provider Joyent, has developed a visualization technique called a flame graph that can be effective for charting how system resources such as CPUs and memory are used. It has subsequently been picked up by a number of engineers who have used it to enhance popular diagnostic tools such as DTrace and Windows XPerf.
Gregg explained how the flame graph works Thursday at the USENIX LISA (Large Installation System Administration) conference in Washington, D.C. Flame graphs could save hours of diagnostic time for system administrators, performance engineers, support staff and others trying to figure out why a system is running more slowly than expected.
“We’ve had stack traces for a long while, but what Brendan has done has given us a really fast way of seeing aspects that weren’t easily visible before,” said one attendee of the presentation, noting that flame graphs would have come in handy for him at work during a recent dispute with a software vendor over a performance issue.
The vendor might have been able to solve the problem in a few hours using a flame graph rather than the three weeks it ended up taking, he said.
Gregg’s expertise lies in the area of measuring system performance. His book on the topic was published this year by Prentice Hall.
In 2008, Gregg, then an employee at Sun Microsystems, attracted attention for showing how disk I/O could be slowed by sudden loud noises, a fact he demonstrated by yelling, quite loudly, at a server. The resulting vibrations had slowed the disks.
Gregg created a YouTube video to demonstrate latency heat maps, a new type of visualization he created to chart system latency. The video went viral in the IT community.
The flame graph came about “under duress,” Gregg said. A customer had voiced concern over an application that was running about 40 percent slower than expected. To investigate the problem, Gregg had to sort through 500,000 lines of diagnostic data. He quickly realized it was far too much data to easily comprehend.
Inspired by visualization guru Edward Tufte, Gregg brainstormed ways to visualize the entire data set within a single screen. What he came up with “merged and collapsed together the common elements,” while preserving the relation among the elements in the amount of resources they consumed.
What is a flame graph?
Flame graphs—like the one shown at the top of the story—are composed of multiple stacks of vertical bars, with each row of bars representing a slice of time, the rows on the bottom being the oldest and the ones on the top of the graph being the newest. Each row might have multiple bars, with each bar representing a different function, and the length of each bar representing the percentage of resources that the function is using at that time.
For a flame graph representing CPU usage, the top bars show what software functions were being executed at the time the data was captured.
CPU flame graphs are built on stack traces, which list all the functions being executed by the CPU at any given time. But the flame graph’s hierarchical presentation of the data encapsulates the flow of actions on a processor.
Examining a graph, an administrator can visually trace which functions are called by other functions. Scanning across different rows can reveal which functions of an individual program, or at a higher level which of a number of concurrently running programs on a machine, are gobbling up a disproportionate amount of the CPU’s attention.
Other flame graphs can be constructed to show how resources are being divided up in memory or with disk I/O.
Others have built programs that use flame graphs to visualize data created by popular performance tools, such as DTrace, Windows XPerf, OS X Instruments, Perl performance tools and Google Chrome Developer Tools.
Gregg said that Dave Pacheco’s node.js implementation for DTrace may even become the canonical flame-graph application, given that it is more advanced than Gregg’s own program.
Beyond flame graphs, Gregg is working on another visualization called frequency trails, an R-based data rendering that shows the characteristics of the outliers in a set of data, which can be useful in determining severe performance issues in cloud computing operations, he said.
Gregg is not a visual person by nature, he said in an interview after his presentation. He is most comfortable with the Unix command line. But the very nature of today’s large distributed systems demands visual aids.
“On a cloud, I need to understand 1,000 servers and I need to understand them right now. Visualization is necessary to do our jobs these days,” he said.