Google compiles an unbelievable amount of sensor output from its data centers in order to calculate Power Usage Effectiveness (PUE). Besides data directly related to PUE, Google amasses other data that seems logical to aggregate, such as current weather conditions and cooling tower, water pump, and heat exchanger efficiencies, but for the most part goes unused because it’s not required by the PUE equation (Figure A).

That changed when Jim Gao, a mechanical engineer on Google’s data-center team, became interested in learning if anything more could be done with the sensor data.

Machine learning

Already an expert in computational fluid dynamics, Gao decided to increase his expertise by taking classes at Stanford on advanced machine learning (i.e., the study of systems that can learn from data). Hindsight suggests it was an attempt to find a better (non-static) way to track a data-center’s energy efficiency.

Gao’s adding machine learning to his repertoire was not unexpected. Google is well-known for its expertise in the field of artificial intelligence, and Google Search relies heavily on machine learning to provide real-time results even though the data is in constant flux — not unlike the sensor data being collected. Joe Kava, VP, Data Centers at Google, explains in a blog post why machine learning was a good choice:

“What Jim designed works a lot like other examples of machine learning, like speech recognition: a computer analyzes large amounts of data to recognize patterns and ‘learn’ from them. In a dynamic environment like a data center, it can be difficult for humans to see how all of the variables — IT load, outside air temperature, etc. — interact with each other.”

Accurate predictability

Both Kava and Gao recognized machine learning’s strength: to see the intricacies in a complex mechanism that humans are unable to decipher or even notice. In the same blog post, Kava stated that Gao has seen some success — his models are within 0.4% of measured PUE. In Figure B, Gao’s model is the blue line, and the actual calculated PUE is in red.

Figure B

The “Holy Grail” of data-center management

The biggest complaint about PUE is that it is a reactive (i.e., after the fact) measurement, so it would be nice to have a trustworthy model that could predict results before changes were made. It appears Gao has found that particular “Holy Grail” of data-center management.

Kava said, “A couple months ago, we had to take some servers offline for a few days — which would normally make that data center less energy efficient. But we were able to use Jim’s models to change our cooling setup temporarily — reducing the impact of the change on our PUE for that time period.”

Since Google and the other big data center operators are concerned about a 0.01% change in PUE, this proactive ability is hugely important.

Final thoughts

Gao wrote a paper in which he explains much of the detailed analysis he went through to build his models (PDF). He ended the paper with this conclusion:

“Actual testing on Google DCs indicate that machine learning is an effective method of using existing sensor data to model DC energy efficiency, and can yield significant cost savings.”

Gao’s machine-learning tool will also help architects and engineers working on new data-center designs, allowing them to step outside the safe, already tested design box.