Storage

Be S.M.A.R.T. about your hard drive's health

Given enough time, all hard disks will fail. Thankfully, most drives support a feature that could give users an early warning before the bitter end. Here's how support pros can introduce S.M.A.R.T. reporting to their clients.

Given enough time, all hard disks will fail. Thankfully, most drives support a feature that could give users an early warning before the bitter end. Here's how support pros can introduce S.M.A.R.T. reporting to their clients.

---------------------------------------------------------------------------------------------------------------

“What do you mean, my hard drive failed? Hard drives can do that?”

Sadly, I encounter this kind of ignorance a lot as a support pro. Lots of people simply don’t understand that their computer’s storage medium has a limited life span. Maybe extremely durable older machines have influenced their expectations, or maybe no one has ever taken the time to explain to them how the disk mechanism works. Either way, it stinks to have to disappoint users by telling them the machine they rely on has gone belly up.

These days, many consumers relate to their computers as if they were appliances, and I think this contributes to their surprise when a machine’s components begin to fail. When explaining the situation to one of my clients, I usually try to liken computers to automobiles; both are complicated mechanical and electronic systems that can have individual components wear to the point of failure through normal use. Part of the routine maintenance procedures of both cars and computers is making sure worn parts are replaced before they fail completely and catastrophically.

There are a couple of obvious methods for determining when an auto is due for maintenance, and most consumers are familiar with their car’s dash status lights and the concept of the 20,000 mile checkup. Lately, I’ve been trying to train my users how they can determine when their computer needs maintenance, and I’ve been using the dash light and mileage metaphors to explain the value of using software to check the S.M.A.R.T. data reported by their hard disks. Experienced support pros will be aware of the Self-Monitoring, Analysis, and Reporting Technology built in to modern drives, but this information will likely be new to most of your clients.

You should make sure that your users understand that S.M.A.R.T. isn’t all knowing. Because it gathers past performance data, it can anticipate failure only due to gradual mechanical wear. It’s not suited to predicting accidental damage, for instance. S.M.A.R.T. status is a lot like the Check Engine light in most cars…just because it’s not lit now doesn’t mean something can’t go wrong tomorrow. And just as an activated Check Engine light means you should be planning to visit your mechanic, a S.M.A.R.T. error means you should be making sure that your data backups are sound and that you have a new hard drive on the way.

Once the concepts of drive wear and S.M.A.R.T. are explained, most users won’t have a problem understanding what’s going on. If you decide that you want to start providing your clients the resources to monitor the health of their own disks, there are a ton of software tools you can choose from. This table at Wikipedia offers a nice feature comparison to get you started. From that matrix, I have firsthand experience with HDD Health, SMARTReporter, and the smartmontools. All are quality third-party packages that I have no problem recommending. Your OS of choice may have some built-in utilities for monitoring disk health as well.

9 comments
cquirke
cquirke

Some hard drive failures can't be predicted by S.M.A.R.T. and are catestrophic; that's a given. However, the S.M.A.R.T. summary is next to useless - waiting for anything other than "OK" is like a doctor waiting for flesh to rot off the bones before signing a death certificate. So you look at S.M.A.R.T.'s detail counters, using something like the free HD Tune utility. Each attribute row has a "current" and "worst" column value that slowly counts down to zero, powered by an increasing raw data count. Quite a few raw data events can occur before the value counters drop by a single point. The attributes I look at, are; reallocation events, reallocated sectors, pending sectors, and offline uncorrectable. All of these should have raw values of zero. Nothing else matters as much, though I look at total hours of use and operating temperature too. There's a blind spot in S.M.A.R.T. that even watching these raw values will miss. Periodically, the raw data values are cleared, and if they haven't escalated far enough to nudge the current value, you could have a continuing stream of significant bad sector events and see nothing, even in the S.M.A.R.T. detail. That fits with what I've seen in practive, e.g. a drive that has no S.M.A.R.T. detail raw value issues, passes surface scan perfectly, then bogs down while virus scanning, shows major bad sectors, and dies soon after. No broken air seals, no physical bumps or rough power. Hard drive vendors talk up S.M.A.R.T. as a service to consumers, but a few Google searches turn up some smoking guns. A new S.M.A.R.T. standard may drop the requirement to make details visible, so all you'd get is the "OK, not dead yet" pabulum, and in one email, the desired target lead time for SMART alerts was - wait for it - 24 hours. I defy anyone but the most heavily-resourced server farm to respond to a S.M.A.R.T. alert, evacuating the drive within 24 hours. See... http://cquirke.blogspot.com/2005_09_01_archive.html ...for a take on S.M.A.R.T., written before I figured out the "counter decriment vs. raw data reset" blind spot.

rboring
rboring

Check out Gibson Research Corp SpinRite v 6.0. If you are serious about tracking the health of a HDD or repairing a weak HDD servo track, or recovering data from a sick HDD, the SpinRite v 6.0 is pure $89 HDD magic.

bboyd
bboyd

I'd say if a drive of mine is giving a smart error it is headed to the shelf. Especially given the huge study Google did for failed HD's. Seeing their research and massive sample size leads me to believe that they may have a better grasp on failure rates than even the manufacturers. http://labs.google.com/papers/disk_failures.pdf "We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in re-allocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART"

williamjones
williamjones

... its primary focus is drive reconditioning, and it only provides SMART data as a check to make sure that your drive is in decent enough condition for SpinRite to run safely. Also, booting from SpinRite media to merely get a SMART report is more intrusive than some of the other solutions on offer. In fact, I've heard Steve Gibson, SpinRite's creator, levy some of the same warnings about the shortcomings of SMART reporting that Google observed. Thanks for providing the SpinRite shout-out!

mwinter1
mwinter1

Why does HDHealth tell me that the SMART functions on my Dell XPS420 pc are not turned on? If this is true then how do I turn it on?

williamjones
williamjones

If a drive starts exhibiting SMART errors, I'll pull it from service immediately. But, just because its not erroring doesn't mean it can't fail. That's why backups are vital. Thanks for linking to the Google research. That's a great article.

williamjones
williamjones

...and if that doesn't help, consult with your system documentation or manufacturer support.

masungit
masungit

back up 100x.... prevention is better than cure