With more than 80 million people worldwide using Netflix to watch 125 million hours of TV and movies each day, the video-streaming service knows all about running tech at scale.
Given the size of Netflix’s ops, when the company upgrades its core technologies, it has to be sure those changes will make a significant difference to its service.
Helping oversee the implementation of those difficult decisions is Yunong Xiao, platform architect at Netflix and leader of the Node.js platform team.
Xiao is charged with overseeing a project to rearchitect the APIs responsible for fetching data every time a user plays a video or browses Netflix’s library.
That might sound like a daunting task, and Xiao describes it as a project with “a fair amount of risk involved” — albeit offset by extensive testing and a phased rollout.
But the benefits promise to be considerable. By redesigning the data access service, which Netflix refers to as Edge, Xiao says the company will become more productive, save money, offer a more reliable service and make better use of staff skills.
Here’s how Netflix expects to benefit from rebuilding the Edge service using containers running Node.js apps.
Easier to manage tech
One of the goals of the redesign is to remove roadblocks for engineers who work on Netflix apps running on mobile, TV and desktop.
At present, the Edge service — which fetches data every time a tablet, TV, PC or any other device connects to Netflix — is unwieldy. The service is based on a monolithic Java Virtual Machine (JVM) process, which demands a huge amount of memory and that relies on a complex mass of scripts and software libraries to set up and run.
Netflix is almost entirely run on Amazon Web Services (AWS), and while the cloud platform’s EC2 service can provision a virtual machine with the memory necessary to support Edge, it’s very difficult for Netflix engineers to put together a local system that can run the process. However, spinning up a local system in this way, rather than using an instance running on EC2, is sometimes desirable for testing.
Even if engineers manage to secure the hardware needed to run the Edge service locally, they still need to recreate the mass of software libraries and scripts needed to support it, gluing together all of the many scripts and ensuring the right versions of software is running.
“That’s generally a very tedious and arduous process to go through when you want to test something,” said Xiao.
Managing the testing process will become far simpler once the Edge service is rebuilt around Node.js apps running inside containers.
At present, the data access Edge service runs many different scripts, written in the Groovy programming language, on top of a single Java Virtual Machine (JVM) process. Each of the many client teams within Netflix, ranging from those handling the service’s iPhone app to the website, run their data-fetching scripts on the same JVM process, which contributes to the service’s sprawling size.
Under the new architecture, each of these Groovy scripts will be replaced by a Node.js app running inside a container.
The new architecture will be easier to work with from several standpoints. Recreating the data access service locally for testing becomes far simpler, as the hardware requirements are far less taxing. Rather than having to set up a test version of the monolithic JVM process for scripts to run on top of, each team will run just their data access Node.js apps and their containers on top of an OS, resulting in a more lightweight computing infrastructure that is easier to spin up locally.
These Docker containers can run on top of many different computing platforms and bundle together all of the software dependencies each Node.js app needs. This portability means engineers recreating test systems locally are spared the hassle of sourcing all of the necessary libraries and scripts. Knowing that test systems mirror the production systems also provides stronger guarantees that software will behave the same in both environments.
The actual process of testing will be much improved too, thanks to the availability of Node.js debugging tools for stepping through the code and monitoring performance.
Xiao said these tools will be a “huge boon” to engineers who currently have to repeat the long and tedious process of running code on test instances on EC2, seeing what is logged by printout statements inside the code, making changes and then repeating the process.
“We have engineers who come up to us every day and say ‘Hey, when is it [the new architecture] going to be ready?,” said Xiao, adding that engineers are telling him: ‘I don’t really want to have to use the old stack anymore because it takes me tens of minutes and I have to sit around and twiddle my thumbs to wait to test my script’.
“The engineers who have been working on this have given us valuable positive feedback and engineers who have heard about this project are itching to get their hands on it.”
For an online company that relies on software to deliver its service to its customers, improving the efficiency of its developers should have a tangible effect on the business, he said.
“If we can give our engineers a 2x or 3x gain in productivity, that’s a tremendous saving for the business and means we can innovate more rapidly than we are today.”
More reliable service
The architecture of the Edge service that all of Netflix’s apps use to access back-end data makes it vulnerable to being taken down by errors.
Because the Edge service runs on a single process, if one of the Groovy scripts running on top of it crashes it can bring down the process, toppling a service that every Netflix client — from iPhone apps to the Netflix.com website — relies upon.
“There’s no process isolation either, imagine if one script has an issue, that can often take down the entire process, which takes down all of Netflix,” said Xiao.
Rearchitecting the data access system so that each Netflix client team relies on its own cluster of Node.js apps, each running inside a container, will remove this vulnerability.
After the change, the worst an errant line of code within an app will do is crash that individual app.
“We’re isolating each individual client from each other, so they can shoot their own feet off but they can’t take down all of Netflix. That’s also a tremendous improvement over the old stack,” said Xiao.
The shift to this container-based architecture for retrieving data also promises to cut Netflix’s infrastructure costs.
Netflix is available across many different types of devices, from TV sticks to mobile phones, and the number of concurrent users of Netflix apps on these devices can vary hugely.
The problem with Netflix’s existing architecture for data access, is that because each of the apps running on different devices rely on the same back-end process, that process has to be “horizontally scaled out for the script that’s under the heaviest load”.
The upshot can be that the data access API that’s seldom used ends up requiring the same amount of memory as the API that’s under constant heavy load.
Under the new architecture, where each Netflix client will access data using its own cluster of Node.js apps running inside containers: “there’ll be some huge infrastructure gains because it means that these low [request number] APIs, we can isolate them onto a handful of containers in each region and that’s it.
“We can [also] dynamically scale up and down the big APIs. They’re only using the resources they need to serve their request and no-one else’s,” said Xiao.
Containers also have the benefit of being a particularly efficient form of virtualization. While virtual machines package together an entire OS and all of its software, containers only bundle the software dependencies exclusive to the application they encapsulate. Multiple containers can run on and share access to the same underlying operating system, as well as common software libraries. Since each container is bundling far less software than a VM, each container takes fewer resources to run.
“I imagine there will be some significant ROI in terms of infrastructure. That, coupled with the fact we’re improving developer productivity,” said Xiao.
That said, Netflix has had to make investments up front in making this shift, primarily in developing its own tools for orchestrating containers on top of AWS.
“All of these technologies are so new, we’ve had to make really major investments in container infrastructure and orchestration efforts,” said Xiao.
“Right now there’s not really any off-the-shelf software that’s available for you to reliably scale and maintain containers inside of a cloud environment.
“So we had to build our own team that did that, that was on our open-source software.”
Read more about containers
- Hey ops teams, developers want control of the data center (TechRepublic)
- MongoDB and Cassandra put relational databases on notice (TechRepublic)
- Docker and Mesos: Like peanut butter and jelly (TechRepublic)
- Why some of the fastest growing databases are also the most experimental(TechRepublic)
- Docker rocker: container technology usage doubles; serious money follows (ZDNet)
- Docker’s no longer all about test-and-dev, says Docker CEO (TechRepublic)