United Airlines grounded its planes for about an hour on Wednesday, reportedly because of a router failure. That’s a wide path of destruction for one piece of equipment, but it’s the kind of hazard that comes with networking, where each piece is always linked to everything else.
The grounding, which started around 8:30 a.m. Eastern time, caused delays across United’s routes and stranded passengers. It came on the same day that a computer-related outage halted trading on the New York Stock Exchange, and just weeks after another technology-related service interruption at United.
The airline blamed network connectivity issues, then pointed the finger at a failed router. While it might seem like redundant routers and cables could keep that kind of problem from having a nationwide impact on an airline, networking problems are rarely as simple as routing around a failure.
“The hard failure is fairly easy,” Gartner analyst Joe Skorupa said. Enterprises often have two routers in case one crashes, and very large companies like United may buy two connections from different carriers at their major facilities. But those simple failover mechanisms only work for total failures that can be immediately detected, he said.
For a wide range of other network problems, redundancy doesn’t help. That’s because routers inherently affect other routers, given that they’re all supposed to work together to get packets where they need to go.
So a router failure could mean a lot of things other than total shutdown. Often, it means a software glitch or a clumsy engineer’s configuration mistake that can spread to other routers or affect their performance. The router may have malfunctioned, but the rest of the network doesn’t know it, said Dell’Oro Group analyst Alam Tamboli. United hasn’t shared more details of its latest problem.
Router software upgrades are a frequent cause of widespread network breakdowns that lead to embarrassing headlines. In fact, updating network software can be such an ordeal that some enterprises keep running the same versions for years, choosing to manage the risk of security holes rather than take on the dangers of an upgrade, Skorupa said.
SDN (software-defined networking) should help to ease those dangers and reduce the number of big failures. It gets administrators away from managing one box at a time and manually typing configurations, replacing that with more centralized and programmable software.
But SDN is only part of a bigger change that’s needed to prevent big failures, said Nick Lippis, co-founder of the Open Networking User Group.
Technology is still managed in silos of networking, storage, computing and virtualization, so IT departments often don’t see the connections between problems in each area, he said. For example, a router failure may have been caused by a storage or server problem, or it may cause those problems. IT needs big-picture managers who understand those connections.
And large enterprises are learning from online operations like Facebook and Google and adopting private and hybrid clouds because, essentially, they’ve gotten too big, Lippis said. “By and large, they’re stable at small scale, but once you get into really large scale, when things go wrong, they go wrong big.”