Discussion on:

Message 42 of 230
0 Votes
+ -
No, ChkDsk /R is not a "fix"
You need to be more proactive when looking for early hard drive failure, if you want to avoid data loss.

If you have episodic slowdowns where keystrokes are delayed and the mouse pointer stops moving, you should suspect hard drive retries on failing sectors. Generally, this pattern of slowdown results when code is delayed within a section where interrupts are disabled, which is usually only within tight and critical bits of driver code.

When the hard drive gets a checksum mis-match, it will retry the operation until it works, or fail after X retries. If "too many" failures, it will mark that sector as bad, and map its address to a spare one; if this is during a read operation, it should try to relocate the contents from the bad sector, but this may fail. All of this happens within the drive itself, and is invisible to the rest of the PC.

If the OS detects a read failure, it will only be if the above mechanisms have failed. If the file system is NTFS, it will retry the operation in much the same way the drive's firmware did, though this time the results may be visible in the depths of the file system. Each retry attempt will likely spawn X retry attempts within the firmware, and could beat a sick drive to death; this is how those seconds-long stuck-mouse pauses arise.

When you format a volume, the same process occurs except presumably without attempts to preserve the disk contents. At least the number of OS-visible bad clusters are reported.

When you do a ChkDsk /R, it's the same sort of thing; nested retries, defects hidden if the drive's firmware "fixes" the problem, etc.

You can see into the firmware's activities via a SMART reporting tool. If enabled in BIOS, POST can tell you if SMART is "bad"; the OS seems to have no awareness of SMAT at all.

SMART tolerates a LOT of defects before it flags a value as "bad", so don't wait for a "bad" SMART status! The critical SMART attributes to watch are Reallocated Sectors, Reallocation Events, Pending Sectors and Offline Uncorrectable. Look at the raw data counters for these, which should all be zero, not just the Value or Worst columns, and least of all the Status column.

Here's how each SMART attribute is reported and logged. Raw events are counted up from zero in the raw Data column, and if these reach a certain number, the Value is reduced by one. Periodically, both raw Data and Value counters are reset, but the lowest-ever Value is retained in Worst. If Value or Worst every reach Threshold, the Status then changes from "OK" to "Bad".

But if the counters are reset before Data ever causes Value to be reduced, everything stays "OK" forever. As it is, several loops of raw Data and steps downwards of Value mean thousands of flaws are considered "OK", as far as Status is concerned.

So... don't wait for bad drives to find you, and don't use ChkDsk /R to paper over the cracks. Use a SMAT reporting tool to look into the raw data details, and act on what you see - first, file-copy the crucials, then file-copy all files, then image the C: partition, then run surface scan diagnostics.

A failing drive may die within an hour, and you don't want to be left only with diagnostic reports, or an unusable part of a failed partition image - that is why I'd do those steps in that particular order.
Posted by cquirke
24th Jan 2012