Netflix answers our call for live challenges, kind of - FREE TO READ

It’s finally here. Part 2 of Netflix’s neuron-massaging technical blog series, ‘Behind the Streams’, has landed a little over two months after the first installment—which is a fairly speedy sequel for a lumbering giant like Netflix.

Part 1, we noted, was mostly self-congratulatory fluff, with Netflix researchers patting Netflix managers on the back for allowing them to publish anything at all. What Faultline wanted was a little more honesty; some real war stories straight from the trenches of live streaming. So, does Part 2 finally offer a peek behind the curtain at the many migraines associated with scaling live streaming?

Kind of.

What Netflix’s researchers have done (under the watchful eye of marketing) is offer a rare glimpse at some of the challenges faced by Netflix while getting its nascent live streaming strategy off the ground around 2022 when it started to pull the pieces together. Many of these barriers will be familiar to those initiated in broadcast to streaming workflows. And for the many video services out there dreaming of scale, invaluable lessons lie beneath the verbose prose.

Allow us to translate Netflix’s latest blogpost into something digestible that makes sense.

Cloud first, chaos later:

Netflix chose to build its feed acquisition and processing straight in the cloud (AWS), because in 2023, when it hosted its inaugural live comedy special, it had no broadcast infrastructure.

Simply put, two dedicated internet access (DIA) circuits were established to deliver UHD feeds at 50 Mbps using HEVC directly to cloud encoders (pictured). This worked, for a while, though the Chris Rock special was not exactly a serious stress test.

As Netflix tried more live events globally, cracks appeared. Scalability limits reared their head fast, and the company pivoted to a “hub-and-spoke” model—with production feeds aggregated at a broadcast facility before cloud ingestion. As already covered in part 1, AWS Elemental MediaConnect was a critical part of this infrastructure for video transport and processing—integrated with Netflix’s own cloud encoding services.

With failure not an option, redundancy became something of an obsession for Netflix. Each feed traveled as four independent streams, protected by SMPTE 2022-7 seamless switching, selecting the best packets in real time.

One useful nugget from the post mentions that, for high-motion content like WWE, Netflix allocated higher bitrates, which Netflix is similar to what the streamer was already doing with SVoD.

Similarly, Netflix’s “battle-hardened” SVoD packager was reused for the live packager simply because it was the “most viable” option, as it assured quick compatibility with Netflix-supported devices without repeating laborious field tests.

Something that was brushed over in part 1 was DRM. In part 2, Netflix says its SVoD packager is fully integrated with the Netflix security backend for encryption key generation, segment encryption, and packager security authentication. The in-house live packager was able to easily use these existing DRM components of the SVoD packager to expedite time to market.

Smarter origins, less hair pulling:

One of Netflix’s few admissions of existing headaches was the static bucket-based origin that it initially relied on (pictured).

Despite establishing dual AWS buckets per region, cross-publishing segments to all buckets, Netflix engineers still couldn’t fix latency spikes or throttling during periods of high concurrency.

A rare admission from Netflix that it did encounter problems with latency in the early days of live streaming, though it does not specific which events were impacted before a solution was implemented.

This latency failure prompted a more intelligent live origin service. Netflix added media-aware functionality to choose optimal segment candidates from multiple pipelines. Advanced Time to Live (TTL) cache control and efficient propagation of metadata improved CDN efficiency, and the live origin now integrates with Netflix’s data platform, to handle throughput required for large-scale live events.

Orchestration was also developed in-house. Since the first live comedy special, a cloud pipeline has been automated to configure encoding based on content and production needs, spanning ingest, encoding, packagers, origin, and monitoring. Off-the-shelf tools supposedly “couldn’t scale” for Netflix events, so the company built its own Control Room dashboard, tying orchestration to operators’ hands.

Control Room:

Lastly, the Control Room is where Netflix tried to borrow from traditional TV. Live feeds from venue encoders and cloud encoders populate the dashboard, allowing operators to monitor critical metrics.

Here, manual failovers can redirect traffic between regions in seconds. Lifecycle management of live titles—start, run, end—is also orchestrated in the Control Room.

Netflix notes that since live events don’t have predetermined endpoints, the end sequence must be triggered manually, exiting viewers from the player and correctly marking DVR windows for later on-demand viewing.

In other words, Netflix has created a bespoke, dual-region live broadcast operation—call it TV in the cloud.

The net result of Part 2 in the ‘Behind the Streams’ series gives us two concrete examples of challenges Netflix faced—static bucket origins and latency handling.

While these are barely a dent in the realities of live streaming chaos, it’s still two more than Part 1.

Netflix answers our call for live challenges, kind of – FREE TO READ

Cloud first, chaos later:

Smarter origins, less hair pulling:

Control Room:

Forecasts & Data

from Rethink TV