Study: Hard Drive Failure Rates Much Higher Than Makers Estimate
Keep Records of How Your Drive Performs
Ashish Nadkarni, a principal consultant at GlassHouse Technologies, a storage services provider in Framingham, Mass., said he isn't surprised by the comparatively high replacement rates because of the difference between the "clean room" environment in which vendors test and the heat, dust, noise or vibrations in an actual data center.
He also said he has seen overall drive quality falling over time as the result of price competition in the industry. He urged customers to begin tracking disk drive records "and to make a big noise with the vendor" to force them to review their testing processes.
FC Drives vs. SATA Drives
While a general reputation for increased reliability (as well as higher performance) is one of the reasons FC drives cost as much as four times more per gigabyte than SATA, "We had no evidence that SATA drives are less reliable than the SCSI or Fibre Channel drives," said Gibson. "I am not suggesting the drive vendors misrepresented anything," he said, adding that other variables such as workloads or environmental conditions might account for the similar reliability finding.
Analyst Brian Garrett at the Enterprise Storage Group in Milford, Mass., said he's not surprised because "the things that can go wrong with a drive are mechanical--moving parts, motors, spindles, read-write heads," and these components are usually the same whether they are used in a SCSI or SATA drive. The electronic circuits around the drive and the physical interface are different, but are much less prone to failure.
Vendors do perform higher levels of testing on FC than on SATA drives, he said, but according to the study that extra testing hasn't produced "a measurable difference" in reliability.
Such findings might spur some customers to, for example, buy more SATA drives to provide more backup or more parity drives in a RAID configuration to get the same level of data protection for a lower price. However, Garrett cautioned, SATA continues to be best suited for applications such as backup and archiving of fixed content (such as e-mail or medical imaging) that must be stored for long periods of time but accessed quickly when it is needed. FC will remain the "gold standard" for online applications such as transaction processing, he predicts.
Don't Sweat the Heat?
The Google study examined replacement rates of more than 100,000 serial and parallel ATA drives deployed in Google's own data centers. Similar to the CMU methodology, a drive was considered to have failed if it was replaced as part of a repair procedure (rather than as being upgraded to a larger drive).
Perhaps the most surprising finding was no strong correlation between higher operating temperatures and higher failure rates. "That doesn't mean there isn't one," said Luiz Barroso, an engineer at Google and co-author of the paper, but it does suggest "that temperature is only one of many factors affecting the disk lifetime."
Garrett said that rapid changes in temperature--such as when a malfunctioning air conditioner is fixed after a hot weekend and rapidly cools the data center--can also cause drive failures.
The Google study also found that no single parameter, or combination of parameters, produced by the SMART (Self-Monitoring Analysis and Reporting Technology) built into disk drives is actually a good predictor of drive failure.
For customers running anything smaller than the massive data centers operated by Google or a university data center, though, the results might make little difference in their day-to-day operations. For many customers, the price of replacement drives is built into their maintenance contracts, so their expected service life only becomes an issue when the equipment goes off warranty and the customer must decide whether to "try to eke out another year or two" before the drive fails, said Garrett.
The studies won't change how Tom Dugan, director of technical services at Recovery Networks, a Philadelphia-based business continuity services provider, protects his data. "If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."