Predicting when hard disks will fail is an inexact science, but a researcher from the University of Nebraska may be close to making it more accurate.
Enterprise storage arrays, whether arranged in SAN, NAS, or object configurations, can contain dozens or hundreds of individual hard disks. More precise insight into which disk may soon fail, even if it’s just an improvement of a few percent, could mean the difference between proactively keeping multiple terabytes of data online or reactively retrieving it from backups.
Junjie Qian, a doctoral student in computer science, said his approach to predicting soon-to-fail disks works by using multiple machine-learning algorithms. Existing approaches use only one model, which suffices but causes too many false positive results, he said.
Qian, whose work is sponsored by NetApp, presented his research in Boston last month at the IEEE Conference on Networking, Architecture, and Storage.
“To the best of our knowledge, industry and academic solutions to date can only achieve either high prediction rate or low false alarm rate, but not both,” Qian wrote in his paper, P3: Priority-based proactive prediction for soon-to-fail disks. His method of decreasing the false positives has a slightly reduced prediction rate, but the method itself is designed to negate that by running hourly. This way would still catch soon-to-fail disks before they are too far gone, he explained. “Our proposed solution is a software function that resides either in the host operating system or in the disk array controller,” he continued.
Qian told TechRepublic that more research is still needed before his approach can be used in commercial products. It needs to be tested and validated on larger data sets, and someone has to see if more data needs to be acquired during the modeling input stage, he explained.
Hard disk manufacturers could help by having more consistency in the data they expose to storage arrays and storage management software, Qian added. “We think one feature [that] if provided can help the prediction is how much data has been read/written to the disk. Each disk vendor and disk model all have variations which impacts the data collected. Another helpful item for the prediction of soon-to-fail disks from the disk manufacturers would be preliminary models or data sets for the disks, to reduce the time to build the model before applying the models to the prediction.”
Commercial disk failure prediction models base their decisions on data called SMART — Self-Monitoring Analysis and Reporting Technology — which is an industry standard first written in the 1990s. The standard is led by a global committee called T13, whose members are aware of general concerns about the type of variations that Qian cited.
Toshiba’s Daniel Colegrove, director of technology industry standards, chairs T13. Storage companies or even customers can make their own SMART parameters, but those were never standardized, he noted. Instead of relying solely on SMART, the committee members — essentially all the major hard disk and component manufacturers — would like to see customers more often use device statistic logs, he said.
T13 is open to making disk failure prediction a part of its standards, Colegrove said. “There’s constant research on how to do better prediction,” he said. However, “There are currently no new [standards] proposals on general failure predictions.”
Whether Qian is the person to bring that idea forward remains to be seen. Asked about his next steps, “I am interested [in building] some reliability maintenance system for storage systems, which integrates both the prediction and follow-up solutions,” he said.
Note: TechRepublic and ZDNet are CBS Interactive properties.