You’re reading this article on a glowing screen powered by an electric grid, the largest and most complex machine ever created. Largely built on 20th century technology, it is the result of millions of interconnected devices working together in highly synchronized ways. Each of these elements behaves individually according to laws of physics. But there is a mystery to how the whole actually works that can baffle even experts in complexity theory.
In other words, we know how the power grid works in theory. It is a collection of a large number of engineered devices, each of which has been programmed (like ants) to follow certain instructions, leading to highly coordinated collective behaviors.
But what if you wanted to build a modern electrical grid like, say, Tesla? You’d probably want to rely on open source innovation to handle electrical grid levels of complexity at unimaginable scales. Open source for resilience, in other words. It’s an insight that a group of clever Tesla engineers recently revealed at a conference in the UK. They rely heavily on a concept called “digital twin.”
SEE: How to build a successful developer career (free PDF) (TechRepublic)
Of digital twins, Kubernetes, and Akka
As the Tesla engineers, Colin Breck and Percy Link, describe the term, digital twin is the representation of a physical Internet of Things (IoT) device–like a battery, an inverter, or a charger in software modeled virtually–and they do digital twin modeling to represent the current state and relationships of various assets in architecting a virtual power system.
Open source software plays a central role in the story Breck and Link tell. Because they cover a lot of (complex) ground, I’ll focus briefly on two open source projects important to Tesla to tackle complexity and ensure resilience: Kubernetes and Akka (from Lightbend, a project I first wrote about in 2014). Together they play a critically intertwined role in how distributed computing and IoT help Tesla guarantee grid resilience.
“(The) majority of our microservices run in Kubernetes, and the pairing of Akka and Kubernetes is really fantastic,” Breck said. “Kubernetes can handle coarse-grained failures in scaling, so that would be things like scaling pods up or down, running liveness probes, or restarting a failed pod with an exponential back off. Then we use Akka for handling fine-grained failures like circuit breaking or retrying an individual request and modeling the state of individual entities like the fact that a battery is charging or discharging.”
For modeling each site in software, this so-called digital twin, they represent each site with an actor. The actor manages state, like the latest reported telemetry from a battery and executes a state machine, changing its behavior if the site is offline and telemetry is delayed. It also provides a convenient model for distribution, concurrency, computation, and failover management.
The programmer worries about modeling an individual site in an actor, and then the Akka runtime handles scaling this to thousands or millions of sites. It’s a very powerful abstraction for IoT in particular, essentially removing the worry about threads, or locks, or concurrency bugs. The higher-level aggregations are also represented by individual actors, and actors maintain their relationships with other actors describing this physical or logical aggregation. Then the telemetry is aggregated by messaging up the hierarchy in-memory in near real-time, and how “real-time” the aggregate is at any level is really just a trade-off between messaging volume and latency.
Tesla can query at any node in this hierarchy to know the aggregate value at that location or query the latest telemetry from an individual site. It can also navigate up and down the hierarchy from any point. The services that perform this real-time hierarchical aggregation run in an Akka cluster. An Akka cluster allows a set of pods with different roles to communicate with each other transparently. The first roll is a set of linearly scalable pods that stream data off Apache Kafka, and they use Akka Streams for back-pressure, bounded resource constraints, and then low latency stream processing.
Akka’s essential role
If you look at the projects sponsored by the Cloud Native Computing Foundation and the projects that have revolved around Kubernetes, they tend to be low-level infrastructure and operations oriented. You’re far less likely to find application development and middleware types of things.
SEE: How open source might prove helpful during the coronavirus pandemic (TechRepublic)
What Akka offers is a bridge between all that infrastructure stuff and the end user. You need a way to write that app. If you look only at what the Kubernetes ecosystem allows–a great set of lower-level tools–there is still a shortage of tools and guidance on how to stitch everything together into a working application. You, the developer, are left to figure out those details. It doesn’t help you with writing services, or figuring out a number of things like:
How they should interact,
How to maintain consistency of the data,
How you should handle failure/resilience,
How you should orchestrate them into holistic workflows, or
How to maintain end-to-end guarantees and SLAs.
When you send a message as a user, you want to have guarantees all the way down to the database and back, or all the way down to the service and back. Who makes sure of those end-to-end guarantees? That’s the middleware. As Breck and Link describe, Tesla uses Akka as that bridge.
Kubernetes gives you empty boxes (docker containers, arranged in pods) to make sure they are available and scale, but doesn’t care much about the character of the application code you put inside the boxes (i.e., whether it’s available and consistent). So Akka and Kubernetes operate at different layers of the stack, working together to provide a single coherent system.
What makes Akka a great fit for Tesla’s use case? It’s built on the actor model. How does this model work? You start with these super lightweight, fully isolated, and autonomous actors/services, with the ability to easily run millions on a single laptop, which gives you an interesting way to model things like Tesla did. It’s perfect for IoT and the digital twin pattern where each physical device is mapped to a live actor. Good analogies can be found in nature where we have self-organizing systems built from autonomous “actors” with emergent behavior, like bacteria or ants.
These types of systems are naturally very resilient. Grids can’t fail–the lights go out. Or worse, the internet flickers off.
Disclosure: I work for AWS, but nothing herein relates to my employment there.