Hard-Drive Failures Surprisingly Frequent

Illustration by Tavis Coburn.
Illustration: Tavis Coburn
Your hard drive may not be as reliable as manufacturers would like you to think. Recent studies by researchers at Carnegie Mellon and Google suggest that vendor Mean Time To Failure (MTTF) ratings for hard drives are a bit misleading.

The Carnegie Mellon study, conducted at several locations, found typical failure rates of 2 to 4 percent and a high of 13 percent, in contrast to the less than 1 percent you'd expect based on vendor MTTF ratings (see chart or click on the thumbnail image below). Google's study pegged the annual failure rate at about 3 percent.

Studies Challenge Claims: Based on the hard-disk industry's Mean Time To Failure estimates, you'd expect less than 1 percent of hard drives to fail each year. But studies of facilities with many hard drives found significantly higher failure rates.
Studies Challenge Claims: Based on the hard-disk industry's Mean Time To Failure estimates, you'd expect less than 1 percent of hard drives to fail each year. But studies of facilities with many hard drives found significantly higher failure rates.
Both studies were based on observations of approximately 100,000 drives, with Google looking at its own farm of consumer-grade disks and Carnegie Mellon examining both consumer-grade drives and the ostensibly more reliable enterprise variety; the latter have beefed-up actuator magnets, more-robust spindle motors, and advanced features such as rotational vibration safeguards.

Defining Failure

Vendors attribute part of the discrepancy between their ratings and the findings in these reports to differing definitions of disk failure. For vendors, it's when a drive fails on one read or write attempt within a set period--typically about 24 hours--on the test bench. Vendors say that, by that criterion, nearly 40 percent of returned drives have not actually failed.

The two new studies, however, consider failure to be any symptom that causes a user--presumably, in both cases, experienced IT types--to replace the drive. Such symptoms include software problems, driver conflicts, and the like, as well as drive failure as defined by vendors.

Also, vendors base MTTF numbers on the past performance of similar drives; no one tries running a new model for years to prove it will last.

Stay Cool

Surprisingly, Google's study found no correlation between drive failure and elevated heat and activity levels. The largest percentage of failures occurred on drives operating within a mild 77-to-88-degree range. However, desktop PCs typically operate at temperatures well over the maximum of 125 degrees reported in the Google study, so the findings do not support running hard drives without adequate airflow to cool them.

Google found that failure rates varied significantly according to make and model, but the company declined to identify failure-prone models. Carnegie Mellon points out that bad manufacturing runs occur and that improvements over the past few years may be yielding more-reliable drives.

Google's study relied in part on SMART (Self-Monitoring And Reporting Technology) data from drives that have this feature. But so many drives failed without any SMART warnings that Google concluded the feature was not helpful in predicting real-world failure patterns.

Google's findings do support one tip: If you encounter a scan error during a routine error check (by running Scandisk, for example), your drive is 39 times more likely to fail within 60 days than drives that don't show such errors. IT pros recommend replacing a drive with scan errors.

Fewer Figures

The most likely immediate fallout from these reports is that vendors will stop touting MTTF figures. In my online research, MTTF figures for consumer drives were already few and far between.

Corporate buyers might rethink purchasing plans in light of Carnegie Mellon's finding that fiber-channel and SCSI drives appear no more reliable than the cheaper SATA variety. But IDC analyst David Reinsel says fiber-channel and SCSI drives are still worthwhile when performance matters.

For most of us, these reports simply reemphasize the need for smart practices. Keep your drives cool and, most important, backed up so that if failure occurs, it's merely an inconvenience and not a financial or emotional disaster.

recommended for you

RAID Made Easy

Read more »

Subscribe to the Power Tips Newsletter

Comments