Workflow problems could hold back object audio

When immersive video and ultra HD are discussed, audio is sometimes almost an afterthought. This perhaps reflects priorities on the consumer electronics side, especially TV sets, where the quality of audio output actually got worse as the screens became thinner, because there was less room to accommodate a decent speaker system. This led to the evolution of sound bars, although increasingly users have needed external speaker arrangements or headphones to enjoy an audio experience in keeping with the higher quality video.

Even then, audio quality has been constrained by lack of innovation on that front, until the industry belatedly added it into the equation for ultra HD alongside High Dynamic Range, Higher Frame Rate, Wide Color Gamut and high resolution for the video. The first improvement, enshrined for example in the Ultra HD Forum’s Phase Guidelines, was channel based audio, which does improve the experience but only really for households with multiple speaker arrangements, which means a small minority.

Channel based audio allows a content creator to partition the sound across multiple channels, but it has to be a specific number. For example with 5.1 sound, the content is distributed with just the metadata needed to play back across five speakers. This means little additional benefit is gained through having more than five speakers, while with less than five the experience is not so good anyway. Furthermore the speakers have to be carefully located to get the most from the audio channel separation.
Object based systems such as Dolby Atmos were devised to overcome these limitations and create sound in such a way that all receiving arrangements, whether sound bars, headphones or any configuration of multiple speakers, would benefit as far as possible within their limitations.

Firstly the separation of sound is much more compelling because it can be matched to the video, so that for example it can be switched between speakers from left to right as an associated object such as a car travels in that direction across the screen. There is also scope for tailoring the sound to whatever speaker configuration the user has. Furthermore object based audio (OBA) opens up new possibilities for personalizing the sound to individual users with scope for converting subtitles or even sign language to speech on demand for example.

Then superficially at least OBA can simplify life for broadcasters by separating production from consumption within the audio chain. By avoiding the need to cater for specific configurations as in channel based audio, OBA can be produced just once for all platforms.

But this ignores the workflow complexity that could result from the personalization, which could entail large numbers of specific combinations of broadcast streams. There are also the challenges of creating effective object based audio in the first place, since new metadata will be needed to support positional sound objects associated with specific locations within a video frame. After all sound engineers have grown up with long established techniques for recording and mixing techniques to create the perfect sound synchronized with video, but these are only optimized for channel-based audio formats. In order to capture sound objects, new tools and algorithms must be developed and for the most part they do not yet exist. Such tools should be robust and intuitive enough for sound professionals to carry their skills across and adapt to the new audio production processes as easily as possible.

At least there are one or two projects working on the necessary workflow and production tools, such as Orpheus (Automatic Sound Source Localization For Object-Based Audio Recording) funded by the European Union. This project is focusing on the technical problem of acquiring audio object metadata, which without computer simulation requires positioning of microphones at specific positions within a scene being filmed.

Typically actors, singers or musical instruments are equipped with microphones next to them, with a spherical microphone array used to capture the entire sound scene from a given point of view. The signals recorded by this microphone array would then be used as a “bed” comprising the different sound sources, along with ambient noises such as reverberation. This combination then gives sound engineers the information they need to mix the signals coherently and enable object based output.

For example knowing where a source is located in relation to the microphone array, combined with the propagation delay between the signals recorded by the spot microphones and that recorded by the microphone array, the sound objects can be tagged with the correct technical locational metadata. This metadata would comprise the source position and delay parameters associated with the signals.

While this example shows the possibilities it also highlights the complexities involved. There is perhaps hope that this process of metadata creation can be automated, just as is happening on the video side.