Phil Beevers, who runs Site Reliability Engineering in London for Google, described the importance of holding what Google calls a "blameless post-mortem" after outages.
The next time something goes wrong inside your company, don't be so quick to play the blame game.
Google revealed yesterday that the secret of keeping its cloud services available 99.978% of the time is not pointing the finger after an infrastructure failure.
Phil Beevers, who runs Google's Site Reliability Engineering in London, described the importance of holding what Google calls a "blameless post-mortem" after an outage.
"A post-mortem is something that we do after any sort of failure or outage. In doing that, what we're trying to do is to understand the root cause of the problem, so we can stop it happening again," he said.
"But I can't stress enough that the idea of this is that it's blameless. The philosophy that we've got is that it's the processes that fail, not the people that fail."
The reason for not attributing blame isn't protecting morale among Google engineers, it's about allowing staff to give their honest, expert opinion, without fear of repercussions.
"We need that blameless culture so we generally understand, with no fear of consequences for any of our careers, that mistakes are going to happen.
"It has to be entirely open, so we can get to the root cause and actually fix the right issues," he said, adding SREs also look for new ways to identify and mitigate similar failures in future.
Like all of the big cloud providers, Google has had several short outages in its cloud services over the past year, with various claims and statistics thrown around about which of the major providers has the most reliable platform.
The willingness of successful businesses to tolerate failure, particularly where every effort was made to get a product or service to succeed, is a philosophy that dates back a long way.
Former IBM CEO Thomas Watson Senior, who oversaw the growth of the company into one of the largest technology firms in the world, said "The faster way to succeed is to double your failure rate".
Similarly, NASA's Apollo Program that took men to the moon more than 40 years ago was characterized by a willingness to learn from the, generally small, failures that transpired during each mission.
In a sense, depersonalizing and learning from failures is an extension of the data-driven approach that Google is renowned for, with the firm famously testing 41 different shades of blue for its logo.
"Google is an incredibly data-driven company," says Beevers, adding that it uses "service-level objectives" to "empirically measure how available and how reliable our services are".
"The idea of using the data is to take the emotion out of the decisions. To make it no longer a human confrontation between operations folks who want reliability and development folks who want better features.
"Instead we make it purely a data-driven decision."
Read more on Google Cloud Platform
- Google Cloud Platform product pricing (Google)
- Google Cloud Platform pricing calculator (Google)
- Google's new 'Always Free' tier gives your company a taste test of public cloud (TechRepublic)
- Microsoft Azure: The smart person's guide (TechRepublic)
- Amazon Web Services: The smart person's guide (TechRepublic)
- Just how big is Google's decision to throw its weight behind OpenStack? (ZDNet)