Buoyant, a new startup founded by two senior engineers from Twitter, aims to become the Cisco of microservices with its open source project, Linkerd. Last month, I spoke to the creator of Linkerd and CEO of Buoyant, William Morgan, to talk about his project at its one-year anniversary.

Billed as a “service mesh” for cloud native applications, Linkerd brings techniques originally developed during Twitter’s struggle against the Fail Whale to anyone building cloud-native applications. A little over a year since its launch, Linkerd already powers some of the world’s most trafficked websites, including Credit Karma, PayPal, and Ticketmaster.

With these big name users of Linkerd in mind, I circled back with Morgan to better understand what he means by the term, “service mesh,” and why it matters for new-school enterprise apps.

You keep using that word

TechRepublic: You say Linkerd is a service mesh. But what is a service mesh?

Morgan: A service mesh is a dedicated infrastructure layer for handling service-to-service communication. For cloud-native applications, which can have potentially hundreds of services, each with hundreds or thousands of instances, this communication can be incredibly complex. But it also forms a crucial part of the application’s runtime behavior. Therefore, managing it is critical to ensuring end-to-end performance and reliability.

The cloud-native ecosystem has seen a steep rise in interest in this idea over the past year. Linkerd joined the Cloud Native Computing Foundation this January, alongside projects like Kubernetes and gRPC, and the service mesh idea now seems poised to become a standard component of the cloud-native stack.

Back to the TCP/IP future…sort of

TechRepublic: The service mesh manages communication. How does that relate to the network stack?

Morgan: It’s an abstraction on top of it. The service mesh assumes that the underlying L3/L4 network is in place and that bytes can get from one place to another on the network.

SEE: Are microservices for real, or just the latest buzzword? (ZDNet)

The service mesh is analogous to TCP/IP in some ways. Just as TCP abstracts the mechanics of reliably delivering bytes between network endpoints, the service mesh abstracts the mechanics of reliably delivering requests between services. Just like TCP, the service mesh doesn’t care about the actual payload. The application has a high-level goal (“send X from A to B”), and the job of the service mesh, like that of TCP, is to accomplish this goal, while handling as many sources of failure as possible.

Unlike TCP, the service mesh has another goal beyond “just make it work”: It provides a uniform, application-wide point for introducing visibility and control into the application runtime. The service mesh moves service communication out of the realm of the invisible, implied infrastructure, and into being a first-class member of the ecosystem, where it can be monitored, managed, and controlled.

Complex logic to simplify developers’ lives

TechRepublic: What does the service mesh actually do to make applications reliable? Does application code have to change?

Morgan: The service mesh is fundamentally an operational model, not a developmental one. The developer ideally shouldn’t even know it’s there. The service mesh provides the logic necessary to make service communication reliable, fast, and safe, without the developer needing to be aware.

This logic is pretty complex.

SEE: Why microservices are about to have their “cloud” moment (TechRepublic)

For instance, Linkerd provides reliability through a wide array of powerful techniques: Circuit-breaking, latency-aware load balancing, eventually consistent (“advisory”) service discovery, retries, and deadlines. All these features must work in conjunction. Large-scale distributed systems provide many opportunities for small, localized failures to escalate into system-wide catastrophic failures. Many of Linkerd’s features are designed to safeguard against these escalations by shedding load when the underlying systems approach their limits.

A new way to deliver reliability

TechRepublic: How is service mesh different from traditional approaches to uptime, site performance, and reliability?

Morgan: Introducing new abstractions to a system is not to be taken lightly. Before cloud-native architectures arrived, service communication was typically just built into the application. What’s different now, with cloud native applications, is that there are so many moving pieces to a single application–you could have hundreds of services and tens of thousands of instances.

SEE: How Twitter’s Fail Whale could save your company (TechRepublic)

If you looked at the typical medium-sized web application of the early 2000s, you’d see a three-tiered architecture. Application logic, web serving logic, and storage logic were each a separate layer. Communication between layers, while complex, was limited in scope–there were exactly two hops. Each layer had dedicated logic for managing any communication, and that was usually fine.

But bigger applications couldn’t do that, and this is where the roots of the service mesh model come in.

Companies like Google, Facebook, and Twitter implemented what was effectively a predecessor of the cloud-native approach, where applications were broken into many, many services. Instead of having tiers, you had a topology. In this world, a generalized communication layer became suddenly relevant, and you saw the rise of the “fat client” library–Twitter’s Finagle, Facebook’s Wangle, and Google’s Stubby libraries being cases in point.

Today, the cloud native architecture is the natural conclusion of this idea. The modern cloud native app runs in containers, has an orchestrator layer, and is built as microservices. The path that a single request follows through the service topology in a cloud-native application can be quite complex. When you combine these three key characteristics of the cloud-native approach–microservices, containers, and orchestrators–you start to absolutely need a dedicated layer for handling service-to-service communication, decoupled from application code, and able to capture the dynamic nature of the underlying environment. This layer is the service mesh.