Data Centers

Netflix on how to build services that scale beyond millions of users

How Netflix plans to use microservices as it redesigns its cloud-based systems to cope with exponential growth in its userbase.

Millions of people worldwide stream more than two billion hours worth of video from Netflix each month.

Over the past seven years, Netflix has grown from offering online movies to a few thousand members to the point where its streaming services last year accounted for 34 percent of peak downstream traffic in the US.

For every person watching movies or TV through Netflix the service spawns even more information describing what is being watched and how. This viewing data details which titles an account has watched, at what point someone stopped viewing a title and what else is being streamed on an account at that moment.

Netflix can handle the billions of instances of viewing data generated each day, but the firm believes it will need to re-architect its systems to keep pace with ever increasing demand.

"In order to scale to the next order of magnitude, we're rethinking the fundamentals of our architecture," said Philip Fisher-Ogden, director of engineering at Netflix, in a blog post.

Microservices

As part of the re-imagining Netflix intends to break down elements that handle viewing data into decoupled components known as microservices.

Microservices mark a shift away from building applications as a monolithic unit, for example a server-side application that handles HTTP requests, executes domain logic and fetches and updates data from a database. The downside of building these various responsibilities into a single executable is that any small coding change requires building and deploying a new version of the server-side application.

In contrast, the microservice approach favours building a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.

Building applications in this way has numerous benefits: each service can be matched to well-defined business requirements and new instances automatically deployed as demand increases. Because each service is discrete, they can also be written in different programming languages and rely on different data storage technologies. Debugging and refactoring code can also be simpler when applications are composed of microservices.

Netflix's plan is to have separate services for handling the collection, processing and provision of viewing data, with decoupled communication between components using signals sent through an event queue.

Polyglot persistence

Netflix's other architectural change will be a shift to polyglot persistence. This approach means using different databases or storage technologies to store data, depending on the type of data or how it is used. Data usage can differ in a number of ways, for instance in the volume being stored or the maximum acceptable latency for reading or writing that data to storage. Polyglot persistence allows data to be matched to a complementary storage technology.

Netflix will use the open source, NoSQL distributed database Cassandra for high volume, low-latency writes and the open source, NoSQL database Redis for high volume, low-latency reads.

Why change?

Netflix's Fisher-Ogden says the changes are being driven by a desire to tackle the pain points in its current architecture, which runs on the Amazon Web Services (AWS) cloud platform.

At present Netflix accounts can access their latest viewing data from the memory of various "stateful nodes" - memory-optimised R3 instances on AWS' EC2 service.

"The biggest downside of this approach was that failure of a single stateful node would prevent 1/nth of the member base from writing to or reading from their viewing history," Fisher-Ogden said.

After experiencing outages the Netlix team also implemented a stateless tier that would be able to fetch stale data as fallback when a stateful node was unreachable.

It is the components that were combined together in the stateful architecture that Ogden-Fisher said "should be separated out into services" as part of the redesign.

"We created the viewing service to encapsulate the domain of viewing data collection, processing, and providing," he said, adding the service had accrued more functionality over time.

"These components would be easier to develop, test, debug, deploy, and operate if they were extracted into their own services."

Additionally, when Netflix users access this stateful tier it splits demand between machines in a way that is subject to creating hot spots - where particular machines end up being responsible for storing and serving a disproportionately high amount of data and requests, worsening latency and performance.

netflix-architecture.jpg
The building blocks of Netflix's next-generation architecture
Image: Netflix
As Netflix expanded from running in a single to multiple AWS regions, the locations where AWS services are hosted, the firm also found it had to build a custom mechanism to communicate between stateful tiers in different regions.

"This added significant, undesirable complexity to our overall system," said Ogden-Fisher.

When it comes to storing this viewing data, Netflix currently uses Cassandra as the primary data store for all persistent data, with memcached layered on top of Cassandra as a guaranteed low-latency path for reading, possible stale, data.

Switching from memcached to Redis for high-volume, low latency reads is worthwhile, said Ogden-Fisher as "Redis' first-class data type support should support writes better than how we did read-modify-writes in memcached".

"We created the stateful tier because we wanted the benefit of memory speed for our highest volume read/write use cases," said Ogden-Fisher, adding that Cassandra in its pre-1.0 versions wasn't running on solid-state drives, which can read and write data far faster than spinning disc hard drives, on AWS.

"We thought we could design a simple but robust distributed stateful system exactly suited to our needs, but ended up with a complex solution that was less robust than mature open source technologies."

About Nick Heath

Nick Heath is chief reporter for TechRepublic. He writes about the technology that IT decision makers need to know about, and the latest happenings in the European tech scene.

Editor's Picks

Free Newsletters, In your Inbox