Twitter decommissioned the Fail Whale in the summer of 2013, an iconic image of failure that was originally created as a birthday card illustration entitled “Lifting a Dreamer.” For Twitter users, the image was more of a nightmare. The whale first surfaced in May 2008 to taunt users as the fast-growing service crashed repeatedly. The constant service outages threatened the viability of this high-profile darling startup.
So, what defeated the Fail Whale and allowed Twitter to grow even faster and eventually IPO?
Much of the credit to date has gone to an open source cluster manager framework, Apache Mesos. But William Morgan, an early engineer at Twitter, said Twitter had another secret up its sleeve: A library called Finagle, which powered Twitter’s internal networking. Morgan was so excited by the potential of Finagle that he left Twitter in 2015 to build a startup he heads as CEO, Buoyant. Morgan and his co-founder launched an open source project called Linkerd to extend the power of Finagle.
I caught up with Morgan recently on Linkerd’s one-year anniversary and talked to him about why he thinks the next generation of cloud applications needs a new networking layer, which he calls the “service mesh.”
Killing the Fail Whale
TechRepublic: Most people in Silicon Valley are familiar with Apache Mesos and it’s gotten most of the credit for helping Twitter overcome its scaling challenges. Finagle is new to me. What is it?
Morgan: To answer that requires a little history.
Twitter’s original monolithic Ruby on Rails application, which we lovingly called “the monorail,” was less of a monorail and more like an old Volkswagen Beetle that someone had strapped a jet engine onto. Twitter was accelerating from 0 to 100 MPH faster than any online service in history, and the Beetle couldn’t take it. Its wheels fell off. We couldn’t keep up with the growth of users and traffic to the site. So, we knew we had to make a significant investment in core infrastructure, and Twitter went through this intensive four or five-year investment.
As a result, all these fantastic infrastructure layers were built in house. Mesos is a great example of that. Closer to my work at Twitter was a platform called Finagle, which managed all of the communication between services. Service A wants to talk to service B, and Finagle makes that happen.
It started as a simple connection layer between libraries. But once we had it running, we realized the whole app was being powered by Finagle. It was like a mesh that all the services were embedded in, and we started adding more and more capabilities to Finagle. So, the service owner would just say “make this request” but under the hood, Finagle was doing load balancing, latency and error handling, talking to service discovery, distributed tracing, reporting metrics–all without the application owner really being involved.
And, it was doing it at the request level. Finagle taught us that, for a highly distributed system with tight SLAs, we needed to think about the network beyond Layer 3 or 4, beyond just TCP packets. We needed to operate at the request level. Because the ways systems would fail at Twitter often started with one tiny thing going wrong somewhere deep in the system, and this tiny failure would cascade and bring down the entire site.
It turns out this is a classic problem with distributed systems.
SEE Why microservices are about to have their “cloud” moment (TechRepublic)
Finagle was a huge step towards solving that, because it gave us ways to prevent these failures from spreading, it optimized the request flow, and most importantly, it gave us a way to operate at a higher level of abstraction. It started very tactical and ended up as a platform, a “service mesh.”
Managing reliability at this level was a core breakthrough in Twitter’s adoption of microservices. And we see this happening in companies all over now. Microservices represent the biggest disruption in enterprise technology this decade. Every layer of the infrastructure stack is going to have to deal with this.
Twitter’s Fail Whale goes mainstream
TechRepublic: How did Finagle become Linkerd?
Morgan: When we left Twitter to start Buoyant, our goal was to take the years of operational knowledge running microservices at scale and turn it into something the rest of the world could use. Twitter felt like it was a few years ahead of the curve, but we knew that the rest of the world would have to do the same architectural shift. It’s unavoidable. The internet keeps getting bigger, “being scalable” is a progressively higher bar to meet, and at the same time everything is getting moved into the cloud and virtualized hardware where the old reliability guarantees around resource isolation and everything else no longer apply.
So everyone will have to deal with this, and they’ll deal with it by moving into what’s being called the “cloud native architecture”: containers, microservices, and orchestrated environments.
We saw Finagle’s transformative power at Twitter in this model, but we knew Finagle itself wouldn’t be enough. Finagle is a library. If you want to use Finagle you have to write your app in Java or Scala, and to use it really effectively, you need to restrict your whole stack to those languages.
With Linkerd we wanted to provide the operational model that Finagle gave us, decoupled from any choice of programming language or stack or infrastructure. We wanted to create this broader abstraction for microservices communication that could span all languages and frameworks for cloud native apps. So Linkerd was really about making the Finagle approach to reliable microservices possible for mass consumption by mainstream enterprises.
Standardizing operations at scale
TechRepublic: What’s next for Linkerd?
Morgan: We’ll keep investing in the Linkerd community by helping users, educating folks who are migrating to microservices and need a way to manage that complexity. We’re fortunate that we’ve found some really engaged and excited users over the course of the past year who are contributing back to the project in very substantial ways and helping others. It feels really good to see that.
As for Buoyant the company, beyond keeping Linkerd growing and healthy and better every day, we want to be in the business of helping companies make their cloud-native applications more reliable and also more secure. The shift towards microservices, containers, and distributed applications is happening faster than almost anyone expected. There is too much entropy and no emergence yet of obvious, well-known best practices that you just adopt. We’re at an inflection point.
We think Linkerd can be standard for how you solve these critical reliability issues.
The service mesh moves communications logic out of the application and into the underlying infrastructure. This is not just a good idea, it’s critical for survival. Just as applications shouldn’t be writing their own TCP stack, they also should not be managing their own load balancing logic, retry and timeout logic, circuit breaking, and all the other stuff you have to do. That’s why the service mesh model is so powerful.