Enterprise storage systems including Microsoft Azure are a year or two away from having the equivalent of run-flat tires to keep damaged hard disks in production.
The idea is called logical depopulating and goes by “depop” in a system command set. It is being proposed by industry standards committees and would allow more efficient maintenance schedules in large-scale data centers, because a hard drive could keep most of its capacity available until its turn to be replaced comes along.
“Drives fail. They’re reliable, but they are mechanical things. The question is how do you manage that gracefully,” explained Joe Breher, a storage architect in Longmont, Colorado who is leading the standards effort. “We’re seeing in a large percentage of cases in drive failures, when they do finally fail, the failure is limited to a single head in the device. What can we do to take advantage of that fact and extend the service life?”
Western Digital’s Hitachi Global Storage Technologies, where Breher works, and Seagate Technology are both already shipping proprietary versions of offline logical depop mechanisms. A host device recognizes a bad drive, reformats it, and puts the good sections back into service at less capacity.
This is possible because recent models of top-end enterprise drives are sealed, so full-drive failures due to particle contamination are becoming less common. Increasingly, when internal components in newer models fail, they don’t cause a chain reaction to the rest of the drive.
Breher said an even better approach would be online logical depop, in which most of a drive’s capacity could still be used without losing its data. The method is still in development and would work because host devices would understand drive failures down to the logical block address layer. They could disconnect the relevant LBAs and keep the rest of a drive spinning. The approach could theoretically be applied to solid-state drives too, he said.
As the standards process moves forward, which could take 12-18 months, another challenge is in tweaking the world’s most-used file systems to recognize the concept of a drive being partially available. “We’ve got 60 years of file system development work that’s built around the concept that a drive is a drive, and it’s got a capacity, and that’s it,” Breher noted — drives are either available or they’re not. Initial feedback from the Linux community has been “not very keen,” so the standards group is brainstorming for ways to improve the idea, he said.
Changes to how Microsoft Windows file systems work would be a slow process, but the Microsoft Azure cloud team is already planning for the offline version of logical depop.
“You end up with so much capacity offline, especially at the end-of-life of a cluster, that you lose a lot of value,” said Aaron Ogus, manager of the Microsoft Azure storage hardware team, in Redmond, Washington. “What is likely to happen with depop is this. In the first generation for Microsoft Azure storage we will do offline depop. In a future iteration we will work with the file system team,” he said.
Ogus said that once the standards are sufficiently mature, his team would probably need 6-12 months of development and implementation time to put offline logical depopulating into Microsoft Azure. “I have frequent meetings with [Breher’s standards committees] and I will push them to be more aggressive,” he added. “The place where I think I get the most bang for my buck is the offline. Online for me is just optimization.”
Some experts are more cautious about the depopulation concept. Depop’s usefulness would depend on how long a drive could remain healthy enough for production, said a chief technology officer who asked not to be identified because he co-founded a prominent storage company. “If it’s a narrow distribution, it doesn’t make a whole lot of difference,” he said.