Supercomputer Race: Tricky to Boost System Speed

Every June and November, with fanfare lacking only in actual drum rolls and trumpet blasts, a new list of the world's fastest supercomputers is revealed. Vendors brag, and the media reach for analogies such as "It would take a patient person with a handheld calculator x number of years (think millennia) to do what this hunk of hardware can spit out in one second."

The latest Top500 list, released in June, was seen as especially noteworthy because it marked the scaling of computing's then-current Mount Everest -- the petaflops barrier. Dubbed "Roadrunner" by its users, a computer built by IBM for Los Alamos National Laboratory in New Mexico topped the list of the 500 fastest computers, burning up the bytes at 1.026 petaflops, or more than 1,000 trillion arithmetic operations per second.

A computer to die for if you are a supercomputer user for whom no machine ever seems fast enough? Maybe not.

Richard Loft, director of supercomputing research at the National Center for Atmospheric Research in Boulder, Colo., says he doubts Roadrunner would operate at more than 2% of its peak rated power on NCAR's ocean and climate models. That would bring it in at 20 to 30 teraflops -- no slouch, to be sure, but so far short of that petaflops goal as to seem more worthy of the nickname "Roadwalker."

"The Top500 list is only useful in telling you the absolute upper bound of the capabilities of the computers," Loft says. "It's not useful in terms of telling you their utility in real scientific calculations."

The problem, he says, is that placement on the Top500 list is determined by performance on a decades-old benchmark called Linpack, which is Fortran code that measures the speed of processors on floating-point math operations -- for example, multiplying two long decimal numbers. It's not meant to rate the overall performance of an application, especially one that does a lot of interprocessor communication or memory access.

Moreover, users and vendors seeking fame high on the list go to elaborate pains to tweak their systems to run Linpack as fast as possible -- a tactic permitted by the list's compilers.

The computer models at NCAR simulate the flow of fluids over time by dividing a big space -- the Pacific Ocean, say -- into huge grids and assigning each cell or group of cells in the grid to a specific processor in a supercomputer.

It's nice to have that processor run very fast, of course, but getting to the end of a 100-year climate simulation requires an enormous number of memory accesses by a processor, something that typically happens much more slowly. In addition, some applications require passing many messages from one processor to another, which can also be relatively slow.

So, for many applications, the bandwidth of the communications network inside the box is far more important than the floating-point performance of its processors. That's even more true for business applications, such as online search or transaction processing.

An even greater bottleneck can crop up in programs that can't easily be broken into uniform, parallel streams of instructions. If a processor gets more than its fair share of work, all the others may wait for it, reducing the overall performance of the machine as seen by the user. Linpack operates on the cells of matrices, and by making the matrices just the right size, users can keep every processor uniformly busy and thereby chalk up impressive performance ratings for the system overall.

"As long as we continue to focus on peak floating-point performance, we are missing the actual hard problem that is holding up a lot of science," Loft says.

Tackling the 'Hard Problem'

But the "hard problem" is getting the attention of computer and chip makers. IBM, which makes the Blue Gene family of supercomputers, has taken a systems approach.

Rather than cobbling together commodity processors with commodity interconnects like Ethernet or InfiniBand -- an approach that others have used -- IBM built five proprietary networks inside Blue Gene, each optimized for a specific kind of work and selectable by the programmer. Members of the Blue Gene family held the No. 1 and No. 2 positions on the Top500 list until June of this year.

By making memory access faster, and by doing it more cleverly, the absolute amount of memory in a system can be reduced, says Dave Turek, vice president of Deep Computing at IBM. As engineers work to build "exascale" computers (a thousand times faster than Roadrunner), that will be essential, he says.

"Going back a few years, you'd build a computer with the fastest processors possible and the most memory possible, and life was good," Turek says. "The question is, how much memory do you need to put on an exascale system? If you want to preserve the kinds of programming models you've had to this point, you'd better have a few hundred million dollars in your pocket to pay for that memory."

And it isn't just the purchase cost of memory that's a problem, Turek notes. Memory draws a lot of expensive power and generates a lot of heat that must be removed by expensive cooling systems.

Faster memory subsystems and faster interconnects will help, Turek says, but supercomputer users will also have to overhaul the programming methods that have evolved over the past 20 years if they hope to utilize the power of exascale computers.

He says users initially criticized Blue Gene for having too little memory, but eventually they were able to scale their applications to run well on 60,000 processors by changing the algorithms in their application code so they were more sparing in their memory use.

Beep! Beep!

IBM calls Roadrunner, which cost Los Alamos $120 million, a "hybrid" architecture because it uses three kinds of processors. Basic computing is done on an off-the-shelf, 3,250-node network, with each node consisting of two dual-core Opteron microprocessors from Advanced Micro Devices Inc.

But Roadrunner's magic comes from a network of 13,000 "accelerators" in the form of Cell Broadband Engines originally developed for the Sony PlayStation 3 video game console and later enhanced by IBM. Each Cell chip contains an IBM Power processor core surrounded by eight simple processing elements.

The Cells are optimized for image processing and mathematical operations, which are central to many scientific applications. A Cell can work on all the elements in a well-defined string or vector, ideal for the matrix math in the Linpack benchmark. Los Alamos says the Cells speed up computation by a factor of four to nine over what the Opterons alone could do. Nevertheless, the lab says it expects its production programs to run at sustained speeds of 20% to 50% of the celebrated 1 petaflops benchmark results.

The advantages of using three kinds of processors come at a cost. Just as the Linpack code had to be optimized for the machine, so do most other programs. A recent report from Los Alamos said this of the effort required to get an important simulation tool to run on Roadrunner: "Accelerating the Monte Carlo code called Milagro took many months, several false starts and modifications of 10% to 30% of the code." But in the end, the lab said, Milagro ran six times faster with the Cell chips than without them, and that was "a crucial achievement for the acceptance of Roadrunner."

Andrew White, Roadrunner project director at Los Alamos, told Computerworld that the effort to port and optimize code for Roadrunner was "less than we thought it would be" after programmers got some experience with it. A program with "tens of thousands of lines of code" is taking about one man-year to get going on the supercomputer, he said.

Invoking Specialization

University of Tennessee computer science professor Jack Dongarra is one of the developers of the Linpack benchmark and a co-publisher of the Top500 report. He calls Roadrunner a "general-purpose computer" but one that, because of its hybrid architecture, "specializes in what it can do." Invoking that specialization is not trivial, he admits.

"If you are writing a program for Roadrunner, you essentially have to write three programs -- one for the AMD Opteron processor, one for the Power core that's on the Cell chip and one for the vector units in the Cell chip," he says. "The only way to get to a point where you'd be happy with the performance is to rewrite your old applications. The guys at Los Alamos believe that they can in fact benefit by rewriting their code."

Dongarra says a computer at the top of the Top500 list will typically spend six years on the list before falling off the bottom, and he doesn't expect Roadrunner's hybrid Opteron/Power/Cell architecture to stay on top for long.

"The trend is to large numbers of [processor] cores on a single die," he says. "And it looks like we'll have this one chip with different kinds of cores on it. We might have cores that specialize in floating point, ones that specialize in graphics and those that are more commodity-based." Exploiting that flexibility so the chip is, in essence, tuned for a specific application domain, such as climate modeling, will require software tools that do not yet exist, he says.

Intel Corp. is doing as Dongarra suggests -- developing specialized microprocessor cores and the software tools to exploit them. It's also responding to Loft's plea for faster memory access.

Bandwidth aside, memory will have to be more power-efficient if exascale computers are to draw reasonable amounts of power, says Steve Pawlowski, an Intel senior fellow. He says both objectives can be met in part by building bigger on-chip cache memories that act as very fast buffers between processor cores and dynamic RAM.

"If you can cache a significant number of DRAM pages, the machine thinks it's talking to flat DRAM at high speeds, and you can populate behind it much slower and more power-efficient DRAMs," he says. "You want the cache big enough to hide the [memory] latency, and you want to be clever in how you manage the pages by doing page prefetching and things like that."

He says Intel is also working on increasing the communication bandwidth of the individual pins that connect the processor chip to the memory controller. "I'd like to push the memory bandwidth to be 10 times greater than it is today by 2013 or 2014," Pawlowski says. "The engineers working for me say I'm crazy, but it's a goal."

In the meantime, Intel and others are working on one or two other possibilities -- very high-speed communication via silicon photonics (light) and "3-D die-stacking," which creates a dense sandwich of CPU and DRAM. Both technologies have been proved in labs but have not yet been shown to be economically viable for manufacturers, Pawlowski says.

Petaflops, peak performance, benchmark results, positions on a list -- "it's a little shell game that everybody plays," says NCAR's Loft. "But all we care about is the number of years of climate we can simulate in one day of wall-clock computer time. That tells you what kinds of experiments you can do." State-of-the-art systems today can simulate about five years per day of computer time, he says, but some climatologists yearn to simulate 100 years in a day.

"The idea," Loft says, "is to get an answer to a question before you forget what the question is."

Subscribe to the Power Tips Newsletter