Study: Hard Drive Failure Rates Much Higher Than Makers Estimate
Customers replace disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University.
The study, presented last month at the 5th USENIX Conference on File and Storage Technologies in San Jose, also shows no evidence that Fibre Channel (FC) drives are any more reliable than less expensive but slower performing Serial ATA (SATA) drives.
That surprising comparison of FC and SATA reliability could speed the trend away from FC to SATA drives for applications such as near-line storage and backup, where storage capacity and cost are more important than sheer performance, analysts said.
At the same conference, another study of more than 100,000 drives in data centers run by Google indicated that temperature seems to have little effect on drive reliability, even as vendors and customers struggle to keep temperature down in their tightly packed data centers. Together, the results show how little information customers have to predict the reliability of disk drives in actual operating conditions and how to choose among various drive types.
Real World vs. Data Sheets
The Carnegie Mellon study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for those drives listed MTBF between 1 million to 1.5 million hours, which the study said should mean annual failure rates "of at most 0.88%." However, the study showed typical annual replacement rates of between 2% and 4%, "and up to 13% observed on some systems."
Garth Gibson, associate professor of computer science at Carnegie Mellon and co-author of the study, was careful to point out that the study didn't necessarily track actual drive failures, but cases in which a customer decided a drive had failed and needed replacement. He also said he has no vendor-specific failure information, and that his goal is not "choosing the best and the worst vendors" but to help them to improve drive design and testing.
He echoed storage vendors and analysts in pointing out that as many as half of the drives returned to vendors actually work fine and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive.
Several drive vendors declined to be interviewed. "The conditions that surround true drive failures are complicated and require a detailed failure analysis to determine what the failure mechanisms were," said a spokesperson for Seagate Technology in Scotts Valley, Calif., in an e-mail. "It is important to not only understand the kind of drive being used, but the system or environment in which it was placed and its workload."
"Regarding various reliability rate questions, it's difficult to provide generalities," said a spokesperson for Hitachi Global Storage Technologies in San Jose, in an e-mail. "We work with each of our customers on an individual basis within their specific environments, and the resulting data is confidential."
Keep Records of How Your Drive Performs
Ashish Nadkarni, a principal consultant at GlassHouse Technologies, a storage services provider in Framingham, Mass., said he isn't surprised by the comparatively high replacement rates because of the difference between the "clean room" environment in which vendors test and the heat, dust, noise or vibrations in an actual data center.
He also said he has seen overall drive quality falling over time as the result of price competition in the industry. He urged customers to begin tracking disk drive records "and to make a big noise with the vendor" to force them to review their testing processes.
FC Drives vs. SATA Drives
While a general reputation for increased reliability (as well as higher performance) is one of the reasons FC drives cost as much as four times more per gigabyte than SATA, "We had no evidence that SATA drives are less reliable than the SCSI or Fibre Channel drives," said Gibson. "I am not suggesting the drive vendors misrepresented anything," he said, adding that other variables such as workloads or environmental conditions might account for the similar reliability finding.
Analyst Brian Garrett at the Enterprise Storage Group in Milford, Mass., said he's not surprised because "the things that can go wrong with a drive are mechanical--moving parts, motors, spindles, read-write heads," and these components are usually the same whether they are used in a SCSI or SATA drive. The electronic circuits around the drive and the physical interface are different, but are much less prone to failure.
Vendors do perform higher levels of testing on FC than on SATA drives, he said, but according to the study that extra testing hasn't produced "a measurable difference" in reliability.
Such findings might spur some customers to, for example, buy more SATA drives to provide more backup or more parity drives in a RAID configuration to get the same level of data protection for a lower price. However, Garrett cautioned, SATA continues to be best suited for applications such as backup and archiving of fixed content (such as e-mail or medical imaging) that must be stored for long periods of time but accessed quickly when it is needed. FC will remain the "gold standard" for online applications such as transaction processing, he predicts.
Don't Sweat the Heat?
The Google study examined replacement rates of more than 100,000 serial and parallel ATA drives deployed in Google's own data centers. Similar to the CMU methodology, a drive was considered to have failed if it was replaced as part of a repair procedure (rather than as being upgraded to a larger drive).
Perhaps the most surprising finding was no strong correlation between higher operating temperatures and higher failure rates. "That doesn't mean there isn't one," said Luiz Barroso, an engineer at Google and co-author of the paper, but it does suggest "that temperature is only one of many factors affecting the disk lifetime."
Garrett said that rapid changes in temperature--such as when a malfunctioning air conditioner is fixed after a hot weekend and rapidly cools the data center--can also cause drive failures.
The Google study also found that no single parameter, or combination of parameters, produced by the SMART (Self-Monitoring Analysis and Reporting Technology) built into disk drives is actually a good predictor of drive failure.
For customers running anything smaller than the massive data centers operated by Google or a university data center, though, the results might make little difference in their day-to-day operations. For many customers, the price of replacement drives is built into their maintenance contracts, so their expected service life only becomes an issue when the equipment goes off warranty and the customer must decide whether to "try to eke out another year or two" before the drive fails, said Garrett.
The studies won't change how Tom Dugan, director of technical services at Recovery Networks, a Philadelphia-based business continuity services provider, protects his data. "If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."