Netflix aims to personalize AI bias for recommendations

Bias in Machine Learning and AI has become a hot topic across all industries, but is particularly relevant for video services because it cuts across just all aspects of the content lifecycle. Sometimes called algorithmic bias, it has relevance in TV for security, metadata management, QoS monitoring, personalization, recommendation and even content production.

Netflix has been most famous for pioneering application of AI to content discovery and personalization and has also been among the first major players to address bias. The idea emerged that there was nothing wrong with bias so long as everybody gained something from it. The aim was to make its service biased in favor of every subscriber by adapting exquisitely to their needs.

Not surprisingly, this is an aspiration rather than a realization, and the first step was to recognize that bias really lies not with algorithms themselves but with the data they are fed, the way they are trained and how they interact with their environment or users. Algorithms often do amplify biases once they are present, partly because of feedback affects, resulting from their scale and speed of operation. Small biases that feed in through the data, or as a result of erroneous training, get picked up and ripple through the system. They can also be distorted by the biases of users interacting with them.

However, just as AI and Machine Learning in principle can eliminate human error they can also detect and react to biases, with the potential to avoid or at least reduce them. That has become a major focus of AI research recently with applicability to pay TV applications, for example security monitoring to bring false positives down to acceptable levels. There was the irony that automation of monitoring was having the opposite of the intended effect in some cases by increasing rather than reducing the need for human experts to sift through all these false positives.

Dealing with bias can also hamper performance of AI and Machine Learning algorithms through the tradeoff between accuracy and accountability or intelligibility, which is now gaining recognition. Generally, the more powerful and accurate the algorithm, the less able it is to explain its actions to humans. This is known alternatively as the black box effect whereby the most effective algorithms come to decisions that cannot be explained in simple terms, even if they are more accurate than humans could be at the same time scale. But if they are bad decisions it can be hard to unravel them in time to avoid the consequences.

There is plenty of debate over even what the chief causes of bias are, but they are all associated loosely with data and/or the user interface. It is useful to drill down into a few fairly clear categories. The first is interaction bias where the algorithm picks up on perhaps unconscious or even random behavior during training, as when Google asked participants to draw a shoe for a Machine Learning trial. The majority of the people drew flat footed shoes, often male and the system then rejected a pair with high heels as not belonging to the category.

This is closely related to the second category, latent bias, where the system “learns” incorrectly to associate certain concepts or categories like “chemical engineer” with men because it hasn’t been given any female examples in what is admittedly a male-dominated profession. Latent bias has resulted in some of the more egregious cases of algorithmic bias involving racial, cultural, or sexual stereotyping. This is really the same as selection basis even though that is sometimes presented as a separate category, where the training data over-represents a particular group, such as white males.

Then there is the broad field of confirmation bias, which is rather different and is relevant for content recommendation. It is the tendency to seek out, favor or recall information that tends to confirm or support pre-conceived ideas or prejudices and can distort Machine Learning systems that are trained through feedback from users. Of course, to some extent people want to be recommended content that does conform with their stated preferences, but confirmation bias can stifle suggestions that are a bit out of the box, preventing users being led towards content they perhaps did not know they liked. This is really the sweet spot for recommendation and differentiation between service providers.

There is also correlation bias, which is perhaps the only one that could be pinned on the algorithm design rather than human or user interface factors. This again is highly relevant for content recommendation, because the technique known as collaborative filtering came to be widely implemented in recommendation engines after having been deployed by Amazon and others in ecommerce. It is based on the assumption that people are clustered around shared tastes, so that if two individuals both like apples and one of them also likes oranges, then the other is more likely to enjoy oranges than someone drawn at random.

This assumption is only partially correct and merely yields probabilities of association that work well enough for lists of products displayed on a web page, such that only one or two need to catch the attention and prompt a purchase. In the case of video recommendation, users are more likely to be turned off by being given overly-predictable lists of suggested titles to watch.

The likes of ThinkAnalytics and Jinni, in particular, promoted their recommendation engine for employing more subtle and advanced techniques than collaborative filtering. They argued that as a result it was better able to make more compelling and sometimes unexpected recommendations.

Then there is the black box bias already mentioned, but this is not really a separate category, since it is caused by algorithms being unaccountable and therefore disguising other forms of bias that they may also have amplified. Such opaqueness is particularly unacceptable in applications like health care where biases must be identified and rooted out as quickly as possible. But because accuracy has to be traded off for transparency, ironically the chance of misdiagnosis might be increased either way, making it vital to achieve the optimum balance between the two. That is one factor holding back AI and Machine Learning in healthcare.

Netflix has recognized one other category of bias which it calls “genre bias”. This is really just a form of correlation bias, resulting from too narrow a definition of content genres. The interesting point here is that collaborative filtering of a sort seems to have made a comeback, with Netflix having found that providing the metadata is fine grained enough and the data sets big enough, it can provide the foundation for even more effective recommendation.

Netflix observed that despite being considered a leader in recommendation, it was in danger of losing its edge as others caught up and also found that customer expectations were moving ahead of its discovery capabilities. Users were spending too long in the discovery mode, perusing 40 or 50 titles on average before deciding what to watch, and therefore not long enough actually viewing.

Netflix invested heavily in further data refinement, conducting research into the paths users took between content of radically different genres, such as say its relatively dark House of Cards to the light Unbreakable Kimmy Schmidt.

To do this Netflix used human experts rather than algorithms to tag content with almost anything they could think of, highlighting how content metadata creation cannot be effectively automated as yet. Levels of violence, romance and comedy were represented via more detailed and inter-related matrices, culled from 250 million individual profiles and then distilled into 2,000 distinct taste communities distinguished by geography as well as preferences.

Netflix cannot claim to have eliminated genre bias but has certainly improved its ability to make effective recommendations without drowning users in long lists. It claims to have increased the number of successful recommendations into genres users have not watched before. For example, it found out that people could be coaxed by degrees to watch ever darker or more violent movies, which begins to enter the ethical realm of AI and Machine Learning, but that is another matter.

It seems to have worked for Netflix, whose brand has recovered from any blip and has become ever more popular. Above all, it has demonstrated that it is data and process that matter more than algorithms.