Internet giants such as Google and Amazon run IT operations that are far larger than most enterprises even dream of, but lessons they learn from managing those humongous systems can benefit others in the industry.
At a few conferences in recent weeks, engineers from Google and Amazon revealed some of the secrets they use to scale their systems with a minimum of administrative headache.
At the Usenix LISA (Large Installation Systems Administration) conference in Washington, Google site reliability engineer Todd Underwood highlighted one of the company’s imperatives that may be surprising: frugality.
“A lot of what Google does is about being super-cheap,” he told an audience of systems administrators.
Google is forced to maniacally control costs because it has learned that “anything that scales with demand is a disaster if you are not cheap about it.”
As a service grows more popular, its costs must grow in a “sub-linear” fashion, he said.
“Add a million users, you really have to add less than a 1,000 quanta of whatever expense you are incurring,” Underwood said. A “quanta” of expense could be people’s time, compute resources, or power.
That thinking is behind Google’s efforts not to purchase off-the-shelf routing equipment from companies such as Cisco or Juniper. Google would need so many ports that it’s more cost-effective to build its own, Underwood said.
He refuted the idea that the challenges Google faces are unique to a company of its size. For one, Google is composed of many smaller services, such as Gmail and Google+.
“The scale of all of Google is not what most application developers inside of Google deal with. They run these things that are comprehensible to each and every one of you,” he told the audience.
Another technique Google employs is to automate everything possible. “We’re doing too much of the machines’ work for them,” he said.
Ideally, an organization should get rid of its system administration altogether, and just build and innovate on existing services offered by others, Underwood said, though he admitted that’s not feasible yet.
Underwood, who has a flair for the dramatic, stated: “I think system administration is over, and I think we should stop doing it. It’s mostly a bad idea that was necessary for a long time but I think it has become a crutch.”
Google’s biggest competitor is not Bing or Apple or Facebook. Rather, it is itself, he said. The company’s engineers aim to make its products as reliable as possible, but that’s not their sole task. If a product is too reliable—which is to say, beyond the five 9’s of reliability (99.999 percent)—then that service is “wasting money” in the company’s eyes.
“The point is not to achieve 100 percent availability. The point is to achieve the target availability—99.999 percent—while moving as fast as you can. If you massively exceed that threshold you are wasting money,” Underwood said.
“Opportunity costs is our biggest competitor,” he said.
The following week at the Amazon Web Services (AWS) re:Invent conference in Las Vegas, James Hamilton, AWS’ vice president and distinguished engineer, discussed the tricks Amazon uses to scale.
Though Amazon is selective about what numbers it shares, AWS is growing at a prodigious rate. Each day, it adds the equivalent amount of compute resources (servers, routers, data center gear) that it had in total in the year 2000, Hamilton said. “This is a different type of scale,” he said.
Key for AWS, which launched in 2006, was good architectural design. Hamilton admitted that Amazon was lucky to have got the architecture for AWS largely correct from the beginning.
“When you see fast growth, you learn about architecture. If there are architectural errors or mistakes made in the application, and the customers decide to use them in a big way, there are lots of outages and lots of pain,” Hamilton said.
The cost of deploying a service on AWS comes down to setting up and deploying the infrastructure, Hamilton explained. For most organizations, IT infrastructure is an expense, not the core of its business. But at AWS, engineers focus solely on driving down costs for the infrastructure.
Like Google, Amazon often builds its own equipment, such as servers. That’s not practical for enterprises, he acknowledged, but it works for an operation as large as AWS.
“If you have tens of thousands of servers doing exactly the same thing, you’d be stealing from your customers not to optimize the hardware,” Hamilton said. He also noted that servers sold through the regular IT hardware channel often cost about 30 percent more than buying individual components from manufacturers.
Not only does this allow AWS to cut costs for customers, but it also allows the company to talk with the component manufacturers directly about improvements that would benefit AWS.
“It makes sense economically to operate this way, and it makes sense from a pace-of-innovation perspective as well,” Hamilton said.
Beyond cloud computing, another field of IT that deals with scalability is supercomputing, in which a single machine may have thousands of nodes, each with dozens of processors. On the last day of the SC13 supercomputer conference, a panel of operators and vendors assembled to discuss scalability issues.
William Kramer, who oversees the National Center for Supercomputing Applications’ Blue Waters machine at the University of Illinois at Urbana Champaign, noted that supercomputing is experiencing tremendous growth, driving the need for new workload scheduling tools to ensure organizations get the most from their investment.
“What is now in a chip—a single piece of silicon—is the size of the systems we were trying to schedule 15 years ago,” Kramer said. “We’ve assumed the operating system or the programmer will handle all that scheduling we were doing.”
The old supercomputing metrics of throughput seem to be fraying. This year, Jack Dongarra, one of the creators of the Linpack benchmark used to compare computers on the SC500 list, called for additional metrics to better gauge a supercomputer’s effectiveness.
Judging a system’s true efficiency can be tricky, though.
“You want to measure the amount of work going through the system over a period of time,” and not just a simplistic measure of how much each node is being utilized, Kramer said.
He noted that an organization can measure the utilization of a system by measuring the percentage of time each node is utilized. But this approach can be misleading in that a workload can be slowed to increase the utilization rate, but as a result, less work is going through the system overall.
John Hengeveld, Intel’s director of HPC marketing, suggested the supercomputing community take a tip from manufacturers of airplane jet engines.
“At Rolls-Royce, you don’t buy a jet engine any longer, you buy hours of propulsion in the air. They ensure you get that number of hours of propulsion for the amount of money you pay. Maybe that is the way we should be doing things now,” Hengeveld said. “We shouldn’t be buying chips, we should buy results.”