
The book Site Reliability Engineering helps readers understand how some Googlers think: It contains the ideas of more than 125 authors. The four editors, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, managed to weave all of the different perspectives into a unified work that conveys a coherent approach to managing distributed production systems.
Site Reliability Engineering delivers 34 chapters–totaling more than 500 printed pages from O’Reilly Media–that encompass the principles and practices that keep Google’s production systems working. The entire book is available online at https://landing.google.com/sre/book.html, along with links to other talks, interviews, publications, and events.
Most IT operations professionals will find the topics covered familiar: Risk management, outage tracking, load balancing, product launches, troubleshooting, communication, and more. At Google, the Site Reliability Engineer (SRE) position puts a software engineer on an operations team. (Many aspects of an SRE’s work are similar to a DevOps role in other organizations.) The book uses a hypothetical service–Shakespeare search as a service–to show how SREs work with various systems.
The following five ideas are just a small sample of the range of topics covered in the book.
1. 100% reliability is almost never the goal
In Embracing Risk, Marc Alvidrez emphasizes that the reliability of a service needs to be determined based on user needs and product manager goals, balanced against cost. So something less than 100% availability might be desirable. For example, when Google acquired YouTube in 2006, the product was still evolving rapidly, so a lower availability target (i.e., an increased acceptance of risk of unavailability) would allow for more features to be added faster. In contrast, G Suite reliability targets might be “set to an external quarterly availability target of 99.9%,” with internal targets set even higher. Mark Roth then elaborates on how product and SRE teams work with what they call “error budgets.”
2. Automate to reduce toil
Vivek Rau provides a specific definition of toil: Work that is “manual, repetitive, automatable, tactical, void of enduring value, and that scales linearly as a service grows.” The chapter, Eliminating Toil goes on to granularly define each of these terms. Quarterly surveys show that SREs spend about one-third of their time on tasks defined as toil. The Evolution of Automation at Google elaborates on various ways Google automated tasks over time, from automating MySQL failover tasks to reducing the time needed to turnup a new cluster.
3. Offline backups work
Near the end of one of the longer chapters, Data Integrity: What You Read Is What You Wrote, Raymond Blum and Rhandeev Singh tell of two times that Google staved off potential data loss with data saved offline. The first case study details how Google restored data to Gmail from GTape in 2011. The second addresses how the team dealt with the logistical challenge of restoring data to Google Music–from 5,000 tapes. Both illustrate the need for robust data recovery systems, as the authors write, “Recognizing that not just anything can go wrong, but that everything will go wrong is a significant step toward preparation for any real emergency.”
4. Improve reliability with distributed consensus
While few companies operate distributed systems at the scale Google does, if you understand the architecture of distributed-consensus systems, you may be able to make decisions that help increase the reliability of your systems and services by choosing to work with vendors that build systems based on these principles. Laura Nolan covers the essential concepts that modern, multi-site data center managers need to know in Managing Critical State: Distributed Consensus for Reliability.
5. Communication matters
Sometimes, seemingly simple changes make a difference, such as who leads a meeting. Niall Murphy (along with several co-authors of Communication and Collaboration in SRE) suggests that when two teams of SREs meet via video, that it helps to have a person from the site where there are fewer people lead the meeting. It’s a subtle way to help balance out the power dynamic between two remote teams of differing sizes. Scientific? No. Useful? Yes.
SRE impact: services that work
The long-term impact of the SRE produces highly automated systems that can be managed with increasingly greater levels of abstraction. As the team’s tagline says, “SRE is what you get when you treat operations as if it’s a software problem.”
Site Reliability Engineering is worth a read for anyone involved in IT operations. And it’s especially worth the time for people at large-scale enterprises with one or more data centers. After reading it, you won’t be able to replicate Google’s systems. You will, however, get some insight into how logical Google SREs approach work, solve problems, and communicate technical concepts clearly.
What’s on your IT reading list for the summer? Let me know in the comments or on Twitter (@awolber).
Also see:
- Pokemon Go: How the cloud saved the smash hit game from collapse (TechRepublic)
- 10 books to add to your DevOps reading list (TechRepublic)
- Why the DevOps faithful keep pulling away from their competitors (TechRepublic)
- Gmail’s disappearing act: Blame the storage software update (ZDNet)
- Google Cloud Platform: The smart person’s guide (TechRepublic)