Audio watermarks gain ground for video audience measurement

Audio fingerprinting has been widely employed in content recognition for some years, especially in the mobile advertising sector, but audio watermarking has been much slower to gain traction. Use cases were less clear and cost of deployment was greater, with challenges in exploiting the technology for deriving meaningful insights.

That has been gradually changing as the big audience measurements groups, particularly Nielsen and WPP’s Kantar Media, have recognized the technology’s longer-term potential and incorporated it within their core platforms. The latest endorsement comes from transcoding vendor Hybrik, which has just announced its cloud-based media processing now supports the embedding of Nielsen audio watermarks in on-demand content.

The growing interest in audio watermarking can be explained by its applicability to tracking of entertainment and advertising content within multi-screen multiplatform services. Video watermarking and audio fingerprinting might look like alternatives, but both are fundamentally unsuitable in this capacity. Video watermarking is more computationally intensive for mark identification and overkill for content recognition, being better suited to forensic applications tracking the origin of a particular constant stream or instance, where the onus is on ensuring marks cannot be removed.

Audio fingerprinting would seem more relevant and certainly mature, having been widely used in different scenarios in the advertising sector involving automatic recognition, smartphone apps, smart TVs and broadcast monitoring. This is how the popular song-identification app Shazam works.

One attraction of incorporating audio-based recognition of some sort into an app is that it allows advertisers to build anonymous profiles of an individual user’s entertainment tastes, while established audience measurement systems are generally tied to a household’s aggregate entertainment consumption.  But the big drawback is that the app is at the mercy of the user and can only look out for audio content when it is open. It is technically possible to enable an app to listen when in background mode, but under Apple iOS this presents a red warning bar labelling it as being obtrusive, while even on Android devices users tend to notice the drain on the battery.

For broadcast monitoring the main application is identifying when ads are airing, as well as enhancing granular viewing information at the household level from set top return path data. But the latter is still a very inexact science because it requires merging airing data about ads with return path data to create profiles of exposure to the ads. It provides an alternative to the established ratings currencies, but they are not necessarily any better.

Audio fingerprinting is also dependent on being Internet-connected to a large remote database for identification of the content. In order to reduce the look up process to an acceptable overhead and latency it is done by comparing small snapshots – fingerprints – rather than large stretches of audio, which brings risk of false positives as it is a statistical process. Incidence of false positives can be reduced but at the expense of creating more false negatives. Above all though fingerprinting of any kind cannot identify individual streams or instances, because no identification data is added.

Meanwhile interest in audio watermarking had been smoldering at a low ebb for several decades, having its roots in frustration by some radio station operators over the inaccuracy of their audience panels, reliant on either handwritten diaries kept by listeners or wired meters. This led a company called Arbitron (later acquired by Nielsen) to develop its Portable People Meter (PPM) which incorporated an early version of audio watermarking in 1988. This was designed to be worn like a pager and record what people were listening to, by recognizing tones encoded subliminally at the broadcast end. Effectively audio watermarking is an oxymoron because it is silent.

Originally the base station used for recharging was also connected to a telephone line but was later replaced with an on-device cellular link coupled with a motion detector so that it switched off after 30 minutes when not worn by a user to preserve battery life. This eventually attracted Nielsen’s interest to acquire the company and incorporate it in its audio division in September 2013 for the seemingly exorbitant price of $1.3 billion.

However, Nielsen has made hay with the technology since then, having created a software development kit (SDK) enabling census-based measurement of over 2,500 station streams, as opposed to the PPM panel-based approach. Nielsen calls this technology CBET (critical band encoding technology), which now has an enhanced version more robust against factors such as audio noise from both the channel and surrounding environment during mark recognition. In fact, by making the PPM codes more detectable reported listening to radio stations went up by 15%, which confirmed what many had suspected that the old system had always been underreporting listening levels.

Nielsen now believes that smartphones are a better place to host the PPM technology, or at least would be if users will tolerate having the app on. It is currently testing smartphones as replacements for PPMs in the US and Canada, having put in some development effort to minimize the battery drain. This of course is the bugbear of all apps, but particularly for those discretionary ones that give the user no direct benefits. Perhaps Nielsen will have to give away smartphones instead of PPMs, or at least offer some incentive.

Nielsen has also been focusing increasingly on video applications of audio watermarking, as with the Hybrik deal. This will enable customers of Hybrik’s encoding pipeline hosted in Amazon AWS to track content usage with Nielsen’s Total Audience Measurement System.

This CBET watermarking technology is already being used by content owners to track audience engagement via Nielsen. The new aspect of the Hybrik deployment is that the watermarking is now done in the cloud, rather than being on the content owner’s premises. As Hybrik CEO David Trescot noted, merging watermarking into the transcoding process makes sense now that so much video is on demand with often no broadcaster involved. With content providers delivering straight into the cloud, that is the best place to insert the marks alongside the encoding.

Trescot highlighted potential customers as larger content creators like Disney and Fox, which already spend substantial sums with Nielsen tracking how audiences are consuming their content. They now want to streamline the creation of Nielsen watermarking to make this process more efficient.

Audio watermarking is used rather than fingerprinting because it is necessary to identify individual ads associated with programs for accurate measurement and accounting. Fingerprints can only identify the generic content. In this case a code is embedded as a silent signature in the audio using Nielsen’s technology, but during the transcode process on Hybrik’s platform. All of the information about that code, for example identifying a particular show on a specific VoD service, is handled at the ends by the content creator and Nielsen.

Then if a brand has one ad that it inserts into 10 different programs during the day for targeted delivery, each of those separate instances of the ad would have a different audio watermark embedded. When a Nielson sampling family then plays a program carrying that ad, Nielsen can track which of the 10 different programs the ad came from.

Nielsen has been on a parallel track with Kantar Media, which also leapt into the fray through acquisition, in this case of Civolution’s audio watermarking unit in 2014. The resulting technology has if anything gained wider attention in the video business, especially for advertising where the appeal is the ability of audio watermarking to couple ads with the programs they run in as in the Hybrik example. Kantar’s technology has now been adopted by the Society of Motion Picture and Television Engineers (SMPTE) in its standard for cross-media workflows being defined by its 24TB Open Binding of IDs Drafting Group.

The idea here is to embed an open standard audio watermark as a mechanism for associating ads and content with scope for faster reporting and audience measurement. Again, fingerprinting would not work because the requirement is to link specific instances of ads and programming in a targeted environment where not all viewers receive the same ad.

The Kantar technology was recommended to SMPTE by Ad-ID, which is a joint venture between four US trade bodies. These are the Association of National Advertisers (ANA), the American Association of Advertising Agencies (4As), EIDR, an industry content registry association and the Coalition for Innovative Media Measurement (CIMM) whose members represent content owners, advertisers and ad agencies. This came after Ad-ID adopted Kantar Media’s audio watermarking technology as its own ID standard, which means it will gain widespread adoption in the US as it has a broad membership across the country’s advertising community.

This can be seen as part of a broader worldwide movement towards standard identifiers throughout the media ecosystem for embedding in content. It will make cross-media workflows more efficient for all parties including content owners, operators and advertisers, with faster reporting for ad verification and audience measurement.

At the technology level we are seeing the four variants of audio and video watermarking and fingerprinting all now enjoying wider take up in use cases for which they are best suited, sometimes with more than one interoperating for content protection, tracking and audience measurement.

Our only issue with all of this is that devices which sit on panel members and listen to the world of entertainment is only an analogous way of counting audiences. It has been shown to be fairly accurate in the past, but whenever there is innovation, there are doubters who believe the uptake of a particular program is higher than it is reported. The alternative is using analytics inside player apps, which report on the behavior of the media player – i.e. what is it playing, did it buffer, did the owner pause it, etc.

The difference here is that this counts every instance of a particular video service, what it does not do is apply the same level of certainty to every video service, and gauge audience numbers across different services. An audio standard might make this possible, but only those audio codes were collected by the device itself and sent back into the cloud and counted by some kind of clearing house such as Nielsen – but Nielsen continues to believe that counting a limited number panel can scale to national populations, a claim that gains varying amounts of agreement depending upon the results.