Your browser is not supported. Please update it.

4 March 2021

Netflix Cosmos migration is a lesson in culture as well as technology

Netflix has reminded everyone why it is infinitely more than a mainstream SVoD platform. Revealing a new microservices-esque ecosystem called Cosmos, with glue dubbed Plato and a portal coined Nirvana, Netflix is in the process of executing its largest ever workload migration effort – with some stark lessons for technology and business cultures.

There is, however, one glaring omission from this week’s Netflix Technology Blog post – in the form of long-term Netflix cloud partner AWS. It suggests Cosmos, a cloud platform combining microservices with asynchronous workflows and serverless functions, signals Netflix coming of age by developing architecture independently of the cloud computing behemoth.

Netflix has long used AWS for almost all its cloud processing and storage requirements, covering databases, analytics, recommendation engines, video transcoding and more, as well as its own form of cloud security around Dynamic AWS Keys and the Cryptex secure storage system. Just about the only thing not handled by AWS is the Netflix CDN infrastructure, which is a strictly in-house development called Open Connect.

Before we dive into what comprises a Cosmos service and why Netflix has made this leap, we implore any technology organization, in any industry and of any size, to take a leaf from Netflix’s engineering culture. Netflix is only where it is today after doing away with hierarchical control, and its commitment to Cosmos is an example of how fusing infrastructure and media algorithm developer teams together can realize a vision that would not be possible in your typical top-down engineering environment.

Ever since its 2007 streaming breakthrough which saw the first-generation media processing and encoding platform go live, Netflix has developed a reputation for fierce in-house technology developments, snubbing best of breed technology suppliers to build both platform and application infrastructure from the ground up.

Years of R&D by Netflix’s Media Cloud Engineering and Encoding Technologies teams eventually produced Reloaded some seven years ago, the third-generation that finally achieved massively stability and scalability to take Netflix to the next level as a technology powerhouse, after the second-gen system proved “extremely difficult to operate,” in Netflix’s own words.

A lot has changed in seven years. Namely growing from around 57 million global subscribers to over 200 million today. With the company’s scale increasing ten-fold in this time, its developer workforce has more than tripled, leaving Netflix’s back office architecture looking and feeling like a lumbering giant rather than a nimble streaming pioneer. Its once successful centralized data model had become a monolithic burden that slowed down the process of rolling out new features to users, while production issues became an expensive liability for developers due to overlaps between infrastructure code and application code.

Drastic action was taken to create a workflow-driven, media-centric microservices architecture that eventually resulted in Cosmos, although the company retains that Cosmos is not a true microservice, but there are similarities. A Cosmos service retains the strong contracts and segregated data/dependencies of a microservice, but adds multi-step workflows and computationally intensive asynchronous serverless functions.

The Netflix Technology Blog describes a programming model of microservices triggering workflows that orchestrate serverless functions, which it says works well for most use cases but in more simplistic applications the added complexity is not worth the benefits.

Cosmos was conceived on four pillars – observability (via logging, tracing, monitoring, alerting and error classification), modularity (a framework for structuring a service), productivity (development tools including test runners, code generators, and a command line interface), and delivery (a fully-managed continuous-delivery system of pipelines).

As Netflix shifts the majority of workloads from Reloaded to Cosmos this year, it plans to update the model with new uses cases, while making the technology faster, more efficient, and easier to use.

A typical Cosmos service.

Taking a video encoding service for example, this is built of components that are scale-agnostic, including API, workflow and functions, which have no special knowledge about the scale at which they are run. These components are built on top of three scale-aware Cosmos subsystems handling the details of distributed work. These are Optimus, an API layer mapping external requests to internal business models, Plato, a workflow layer for business rule modeling, and Stratum, a serverless layer for running computational-intensive functions.

Netflix explains that these subsystems all communicate asynchronously and can be deployed independently through a purpose-built managed Continuous Delivery process. So, by separating these out, the idea is that it makes it easier to write, test and operate Cosmos services.

A snippet of the observability portal in action, coined Nirvana, shows a typical service request in Cosmos, in this case a video encoder service. Simply, there is one API call to encode, which includes the video source and a recipe. The video is split into 31 chunks, and the 31 encoding functions run in parallel. The assemble function and index functions are each invoked once, and the workflow is complete after 8 minutes.

While the blog post does not elaborate on delivery beyond production to the CDN, we suspect the Open Connect team will have some ongoing involvement in the migration from Reloaded to Cosmos.

Back when Open Connect was founded in 2011, the possibilities of an elastic CDN were unrealized. Now, with cloud-native technologies and microservices, the private CDN has been unleashed into a container-based ecosystem with orchestration – capable of scaling instantly and caching instinctively depending on traffic, anticipating peaks, with central monitoring and analytics systems, and even the ability to offload resources onto public cloud infrastructure if the capacity of private cloud is not enough, allowing the orchestrator to launch additional instances.

Netflix Open Connect designed a directed caching system with efficiency gains over standard on-demand-driven CDNs, to reduce overall demand on upstream network capacity by several orders of magnitude.

While Open Connect is not strictly open source software, it is the product of years of collaborative work. We wonder then if Netflix would ever consider open sourcing elements of Cosmos?