11 Infamous Software Bugs

Page 3 of 5

Bad PR Bugs

Some bugs are noisy: They cause explosions that destroy machines. Others are subtler in their destructiveness: They cause severe embarrassment that turns companies' good names to "Mud" and sometimes threatens the bottom line.

Pentium Chips Fail Math

In 1994, an entire line of CPUs by market leader Intel simply couldn't do their math. The Pentium floating-point flaw ensured that no matter what software you used, your results stood a chance of being inaccurate past the eighth decimal point. The problem lay in a faulty math coprocessor, also kno

Illustration: Jeffrey Pelo
wn as a floating-point unit. The result was a small possibility of tiny errors in hardcore calculations, but it was a costly PR debacle for Intel.

How did the first generation of Pentiums go wrong? Intel's laudable idea was to triple the execution speed of floating-point calculations by ditching the previous-generation 486 processor's clunky shift-and-subtract algorithm and substituting a lookup-table approach in the Pentium. So far, so smart. The lookup table consisted of 1,066 table entries, downloaded into the programmable logic array of the chip. But only 1,061 entries made it onto the first-generation Pentiums; five got lost on the way.

When the floating-point unit accessed any of the empty cells, it would get a zero response instead of the real answer. A zero response from one cell didn't actually return an answer of zero: A few obscure calculations returned slight errors typically around the tenth decimal digit, so the error passed by quality control and into production.

What did that mean for the lay user? Not much. With this kind of bug, there's a 1-in-360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds of 1-to-9 billion against, was that any errors would happen in the 9th or 10th decimal digit.

But wouldn't you know it? A Virginia-based math professor named Thomas Nicely needed that level of accuracy, found he wasn't getting it and figured out why.

In October 1994, he alerted Intel, then others, to the problem. Intel retorted with a response only marginally less tactful than "Oh, that thing? Yeah, we noticed that back in June."

Thus began an inexorable slide into PR hell and a costly mop-up bill. In January 1995, Intel announced a pretax charge of $475 million against earnings, most of which apparently stemmed from replacing flawed processors.

The bottom line in this arithmetic mess is this: In lookup-table and money calculations, 1,066 - 5 = -$475,000,000. Any way you look at it, that's bad math.

Call Waiting ... and Waiting ... and Waiting

On Jan. 15, 1990, around 60,000 AT&T long-distance customers tried to place long-distance calls as usual -- and got nothing. Behind the scenes, the company's 4ESS long-distance switches, all 114 of them, kept rebooting in sequence. AT&T assumed it was being hacked, and for nine hours, the company and law enforcement tried to work out what was happening. In the end, AT&T uncovered the culprit: an obscure fault in its new software.

Here's how the switches were supposed to work: If one switch gets congested, it sends a "do not disturb" message to the next switch, which picks up its traffic. The second switch resets itself to keep from disturbing the first switch. Switch 2 checks back on Switch 1, and if it detects activity, it does another reset to reflect that Switch 1 is back online. So far, so simple.

The month before the crash, AT&T tweaked the code to speed up the process. The trouble was, things were too fast. The first server to overload sent two messages, one of which hit the second server just as it was resetting. The second server assumed that there was a fault in its CCS7 internal logic and reset itself. It put up its own "do not disturb" sign and passed the problem on to a third switch.

The third switch also got overwhelmed and reset itself, and so the problem cascaded through the whole system. All 114 switches in the system kept resetting themselves, until engineers reduced the message load on the whole system and the wave of resets finally broke.

In the meantime, AT&T lost an estimated $60 million in long-distance charges from calls that didn't go through. The company took a further financial hit a few weeks later when it knocked a third off its regular long-distance rates on Valentine's Day to make amends with customers.

Windows Genuine Disadvantage

Artwork: Chip Taylor
Introduced in 2006, Windows Genuine Advantage was never a popular initiative with Microsoft's customers. Consumers had trouble seeing the advantages: It did nothing to help the security or stability of a legitimate Windows installation. All it did was help Microsoft root out software piracy.

In that task, it was as vigilant as, well, a vigilante. In fact, in late-August 2007, it found piracy everywhere it looked -- even among thousands of legitimate Windows customers.

On Friday, Aug. 24, someone on the WGA team accidentally installed bug-filled preproduction software on the WGA servers. The team quickly rolled back to a tested release of the software, but they didn't check that their fix actually addressed the problem. It didn't. So for 19 hours, until around 3 p.m. the following day, the server flagged thousands of WGA clients across the globe as illegal.

Windows XP customers were told they were running pirated software. Windows Vista customers were slapped harder: They had features turned off, including the eye candy Aero theme and support for ReadyBoost virtual RAM drives.

The first official response to complaints didn't help much: Disgruntled patrons were advised to try to revalidate on Tuesday. But even when the problem was fixed, mid-Saturday afternoon, Vista clients still had to revalidate their Windows installations before they could ReadyBoost their way back into Aero.

OK, so this was a relatively mild issue in engineering terms, and strictly speaking, it was caused by human error. But the error in question was deploying buggy, untested software, and when you factor in the number of people affected, the level of anger induced and the knock-on effect of bad publicity, it was more severe than it seems at first glance.

| 1 2 3 4 5 Page 3
Shop Tech Products at Amazon