Low latency streaming loomed large at last year’s IBC but will be even more prominent this time round. This is because there are even more contenders after Apple threw a new hat into the ring with the surprise launch of Low-Latency HLS at its Worldwide Developer Conference (WWDC) in June 2019. There will be various sessions and demonstrations of low latency streaming at IBC featuring various combinations of standards and technologies but it is a fair bet many delegates will end up even more confused than when they arrived. One problem is that all parties including standards bodies tell different stories from their own point of view, while analysts have often been guilty of failing to paint the whole picture on a single canvas.
A common failing is forgetting that no single technology solves the whole end to end latency challenge on its own, largely because there are three distinct components that are largely independent of each other, remembering that we are talking here about live. These are sometimes called intrinsic latency, network latency and player latency.
Intrinsic latency is really a misnomer because it suggests there is nothing that can be done about it, which is incorrect. It is more that it is upstream at the production level and so may be beyond the reach of an operator, including contribution from location to studio, graphic compositing, commentary and other contributors to this upfront delay. This form of delay is the most resistant to reduction and sticks stubbornly at around 5 seconds with improvements involving complexity and cost in the production workflow. Indeed, given progress in the other two categories, intrinsic latency is likely to become the major cause of end to end delay.
Network latency is the category most often talked about and addressed by methods such as WebRTC, SRT (Secure Reliable Transport) and RIST (Reliable Internet Stream Transport), or before them RTMP (Real Time Messaging Protocol). This in turn has its own intrinsic component determined by the laws of physics, that is the end to end signal propagation time governed in turn by the speed of wave propagation in copper wires, fiber optic cables or over the air, added to the total switching and forwarding time in network components such as routers.
But network latency also includes time induced by error correction and that is where these latency methods kick in. Given that video is distributed in compressed form, error correction is essential to keep lost IP packets within bounds that have no discernible impact on video quality. The impact of dropped IP packets is amplified with compression, especially if they are associated with i-frames because these provide reference for the GOP (Group of Pictures) frames following.
Finally, player latency, also known as forward buffer latency, is the delay imposed by the user’s viewing device to buffer against slowdown in packet arrival in the event of network congestion or temporary fault. The more packets held ahead of playback in the forward buffer, the less likely the user is to experience a glitch, usually re-buffering.
While intrinsic latency stands on its own, clearly there is a relationship between network and player latency in so far that if the user’s device receives totally error free video there is no need for it to buffer any. That of course is a theoretical situation and in practice the two are independent given that the user rarely has any influence over the network. Nonetheless, a trade-off could be made within a service if an operator manipulated network latency and buffer length for given users. Network latency could be trimmed at the expense of higher errors which would then in turn increase buffer latency to cater for these in the player.
Apple’s Low Latency HLS to some extent addresses network latency and player latency together but does not operate at the level of transmission protocols such as TCP. However, it has already drawn a lot of flak on both counts, making it all the more surprising Apple failed to follow the line that appeared to have been traced through an earlier agreement with Microsoft over streaming.
We recall that until a few years ago, in the earlier days of streaming, content owners wanting to reach both Apple and Microsoft devices had to encode and store the same audio and video twice. They had to do it once for Apple’s HTTP Live Streaming (HLS) protocol using the .ts format and once for MPEG DASH as it was then called using .mp4 containers.
But then Microsoft and Apple came together with joint support for a new standard called Common Media Application Format (CMAF), which would use fragmented .mp4 containers that could be referenced by both HLS and DASH, signaling a convergence between the two.
But it seems the “not invented here syndrome” still afflicts Apple and so it came out with Low Latency HLS, continuing to plough its own furrow and surprising some streaming commentators, drawing disapproval from some. The main criticism is that it is does little to justify the name since its operation would seem to increase both network and player latency, or at any rate maintain the disadvantages of HLS. When discovering new segments for streaming HLS polls a server, which incurs a time cost, especially if it goes via a CDN when that request is usually cost. These client/server polls mount up and can add several seconds to the network latency budget.
Then on the player side, clients are supposed to start their playback at least three segments behind the last available segment. So if the segment duration is 6 seconds, this means the forward buffer is 18 seconds, which is clearly an unacceptably high latency. It is not clear at this stage what Apple is doing to rectify this under Low Latency HLS.
To some extent this is a red herring as the real focus of attention is on the error correction under the network latency category, because that is where there is most scope for innovation and improvement. There are only two ways of catering for dropped, lost or corrupted IP packets during transmission, to incorporate redundancy through addition of extra bits in the hope that there is enough information to recreate the packet at the receiving end, and to retransmit the packet.
The former has been widely deployed as Forward Error Correction (FEC) in video services and is similar in principle to RAID (Redundant Array of Inexpensive Disks), used to protect against hard disk drive errors. Extra bits are sent, so that imposes a bandwidth overhead and corresponding slightly greater delay. But the attraction was that it avoided need for any packet transmissions which at first sight add a lot more latency because it is necessary to wait for a retransmission request and then redelivery, adding a whole round-trip delay. The downside however is that FEC only works well with low packet loss rates, otherwise the amount of extra information that has to be added for redundancy becomes prohibitive. FEC has therefore not been widely adopted for streaming over the internet where packet loss rates are much higher than say over a managed IPTV service.
Again, note we are talking about live, because for on-demand streaming the TCP transport protocol with packet retransmission is perfectly suited to streaming and has been widely adopted among others Netflix and YouTube. Indeed, nearly all today’s web traffic still enjoys the protection against packet loss incorporated in the TCP/IP protocol stack. It works well when latency is not an issue, because video streaming is achieved with pre-fetching of packets and use of buffering to smooth out video playout. All IP packets are delivered, although under extreme conditions not all get played because some would miss the play-out deadline if they had to be resent multiple times.
But TCP is too inefficient for live streaming, because all packets have to be acknowledged by the receiver so that the sender can retransmit any that have been lost. This imposes unacceptable latency because of buffering between senders and receivers in the workflow, while scaling appallingly for live transmission, because the network becomes flooded with receipt acknowledgements, reducing the bandwidth efficiency. The buffering established between every sender and receiver in the workflow, in every router for example, also introduces enormous transmission delays.
For this reason, TCP-based protocols such as RTMP are being superseded for live video streaming. There are a number of candidates, foremost being WebRTC, SRT and RIST, all of which we have dissected before. The main point to make here is that all either support FEC or have promised to do so for those situations where that is required, but essentially are based on some form of Automatic Request (ARC) for packet retransmission. This is because there are no live streaming use cases where FEC is actually superior to ARC and it only achieves parity for latency and quality when packet loss is low, as indicated on the following chart, courtesy of Haivision.
As is well known, packet retransmission is always better for on-demand video, but is also far superior for live when packet loss is high, or even when moderate, where FEC imposes grotesque bandwidth inefficiencies and significant latency.
The challenge then, and where the protocols compete, lies in making ARQ as much better than TCP as possible while achieving essentially the same goal of insulating the receiver against lost IP packets. There are some differences, notably between RIST and SRT which increasingly look like emerging as two principle variants operators will need to support, just as they had to adopt DASH and HLS in many cases.
One difference is that RIST is more wedded to legacy in order to ensure maximum interoperability and uses the established Real Time Transport Control Protocol (RTCP) in association with RTP (Real Time Transport Protocol). This allows ranges of packets to be resent in the event of losses being detected. A critical point here is that to keep the protocol stable, the receiver must be able to differentiate between original packets and retransmitted ones, to ensure that the flow is maintained. If for example a retransmitted packet arrives too late at the time packets ahead of it have already emptied from the receive buffer, it must be ignored.
RIST uses the standard SSRC synchronization field in the RTP header to make this differentiation, as recommended by RFC 4588, but with the difference that the least significant bit of that field is set to 0 in original packets and 1 in retransmitted ones. This simple innovation allows the maximum compatibility with non-RIST receivers, with the ability to decide whether to ignore or take account of retransmitted packets after noting the value of that bit.
The other innovation in RIST worth mentioning is support for bonding, allowing a high-bandwidth stream to be sent over multiple low-bandwidth connections and reassembled at the destination. This has two benefits, firstly to increase flexibility and possibly reduce cost of transmission. Secondly it allows error free streaming over poor quality networks by sending duplicate streams over separate links. This can also reduce latency, firstly by avoiding the need to wait for a high bandwidth stream to become available, and secondly by reducing the need for retransmissions, since the redundancy is instead enabled primarily by using duplicate paths.
However, we should acknowledge that SRT embodies more advanced mechanisms to strip latency not available under RTP or RTCP. These include dynamic retransmissions where packets are only resent if it can be determined they would arrive within their latency budget, along with ability to reconstruct the order and timing of packets at the destination to minimize additional processing delay.
As Ulrik Rohne, VP Research & Development at Swedish media transport firm Net Insight, pointed out when we compared SRT and RIST in an earlier piece, RIST is optimal for high-bandwidth applications and is better for remote production since all traffic can be converged in the same protected tunnel, minimizing operational complexity. However, SRT works best in many existing scenarios on the distribution side, because a lot of vendors support it, including Microsoft Azure.
As we have said before, we would like to see these two movements converging on a single standard that preserves interoperability while incorporating some of the advanced features of both. In practice though, given the momentum building behind them with development of separate large ecosystems, streamers are likely to have to support both to maximize reach in the foreseeable future.