Two weeks ago, I experienced every Windows user’s nightmare: a sudden, serious, data-destroying hard disk crash. It wasn’t catastrophic—I was able to recover all my data from backup tapes—but I wonder whether I could have done a better job of anticipating this crisis before my disk began making horrible grinding sounds. So for this Microsoft Challenge, I asked TechRepublic members to share their secrets for monitoring the health and performance of hard disks. Do you have a set of procedures and third-party tools you use to keep your disks running at peak efficiency?

To make this Challenge a bit more difficult, I imposed two requirements: First, any utilities must be fully compatible with Windows 2000. Second, any tests must be able to run without requiring a reboot. That’s an essential condition, especially for servers, where extensive downtime is usually not an option. It also reflects the reality that most of us will put off even the most important system maintenance chores if the hurdles are too great.

The best response came from TechRepublic member stevel, who has put together an excellent program for protecting his company’s servers: “We use a variety of tools to keep track of disk reliability as well as many other performance issues. All our systems (primarily Compaq servers) use hot-swappable RAID. In the event of a drive failure, you simply yank it and stuff another in, and the controller rebuilds it automatically. All of our equipment is SNMP-compatible, so we can monitor it using a variety of software. We use Compaq’s Insight Manager, as well as the OnePoint product suite (from Mission Critical Software) to monitor a number of important things, from drive performance to processor usage and temperature. This allows us to predict failures in many cases.”

TechRepublic member uwe.stutz echoed that recommendation: “On a server, that would be the job of the RAID controller, associated utilities, and the server manufacturer’s system monitoring utilities. Key is remote notification of an ‘event.'” Uwe recommended the tools that come with HP’s Netserver line and added a pointer to third-party tools like CA Unicenter TNG.

That’s fine for servers, where a full-blown RAID implementation and SNMP monitoring make great sense economically. However, the RAID option is way too expensive for the average desktop PC. Fortunately, stevel suggested an option that most users don’t take advantage of: “All our Compaq drives are SMART capable. While this doesn’t always predict failures, it will send a trap to our monitoring software when it feels a failure is possible in the near future. In Insight Manager and OnePoint, this pops up as a little yellow flag suggesting that we replace the drive before failure. Many new hard drives, both SCSI and IDE, support this feature. In most cases, you can turn on SMART in your BIOS, and it will automatically pick up those drives in your system that support it.”

That pointer led me on a search for more information about self-monitoring analysis and reporting technology. I found an excellent explanation of the technology at PC Guide’s Hard Disk Quality and Reliability Features page. SMART, they report, evolved from IBM research into predictive failure analysis. Based on that lead, I headed for IBM’s Web site, where I hit the mother lode—a page devoted to Support and Utilities for Hard Disk Drives. By a happy coincidence, my new drive is an IBM DeskStar, so the Drive FitnessTest and EZ-S.M.A.R.T. utilities available from IBM should be just the ticket. I was also pleased to find small, free utilities that can wipe data from an old hard disk and zap a drive’s master boot record and partition table to transfer a drive from one PC to another without hassles or security headaches

Other drive makers have their own SMART utilities, which vary widely in value. If you prefer to use a full-strength, third-party solution, follow the advice of TechRepublic member mike, who argued that I’m being unrealistic to insist on a solution that works under Windows: “To correctly test a disk, the IDE controller, disk access system, and other considerations of the OS accessing the disk through its interface have to be considered. Therefore, there is unlikely to be a ‘no reboot’ utility that does a thorough, deep test for disk integrity, as it must have its own disk access system at a low level to eliminate the unknown factors. Take a look at SpinRite 5 from Gibson Research. It does a very thorough test of disk integrity.”

I’ve been a fan of Steve Gibson’s work for years. On a cost-per-kilobyte basis, it’s not cheap: A single license costs $89 for a program that downloads in a mere 97KB. But its many users rave about its ability to recover data from even badly damaged hard disks, and I can attest to its capabilities from previous versions. It’s going in my toolbox for the next time I have a disk disaster.

Here’s Ed’s new Challenge
Microsoft just released Beta 1 of Whistler, its next Windows version. This sweeping operating system update includes options for two different interfaces—a newly designed front end intended for nontechnical users and the “classic” Windows 2000 interface. Surprisingly, Microsoft’s product managers say they’ll be redesigning the interface up until the last weeks of development. That’s an opportunity for people like you and me to make a difference. So register your opinions here: If you were in charge of overhauling the Windows desktop, what would you do first? Which features would you ditch? Which would you redesign? Which would you keep? Believe it or not, this is a serious opportunity to make a difference in the next Windows version. I’ll summarize the best responses in my next column, and I’ll also pass them along to the folks at Microsoft who are redoing the Windows UI. Click here to tackle this week’s Microsoft Challenge. I’ve got 2,000 TechPoints to hand out for the best responses, and maybe, just maybe, we can collectively make a difference in the next version of Windows.