Machine Learning and AI bias needs tackling from ground up

Machine Learning (ML) and AI are the latest concepts to scale Gartner’s hype curve and start penetrating just about every corner of the video ecosystem, including recommendations, security, QoS monitoring, curation, metadata generation, voice driven User Interfaces (UIs) and programmatic advertising. In all these areas they are presented as panaceas that enable intelligence to be automated and scaled up, while avoiding various human biases or preconceptions. Definitions vary but essentially ML embraces algorithms that improve with continued exposure to a given data set or environment without human intervention, while AI includes other techniques that mimic human intelligence in various ways.

Lately the hype has been reined in a bit by reality as the limitations of current methods have become clear, as well as universal constraints over what can be achieved, resulting from irreducible errors in data sets associated with sampling or measurement errors. It seems that bias is not confined to humans, but is shared by machines in a different context. While human bias could be said to be rooted in the algorithms themselves as they are corrupted by emotion, prejudice or preconception, in the case of machines the fault lies in the data, as has become clear from several high-profile incidents.

A major problem lies in unconscious biases within the data, a simple example being facial recognition, where some systems have proved to be much better say at recognizing people of a given gender or ethnic group simply because it is members of that group the machine has been exposed too. There are also subtler biases associated with the meanings a system derives from certain words, which is an issue for voice driven personal assistants such as Apple’s Siri or Amazon’s Alexa.

Researchers from Microsoft and Boston University in July 2016 demonstrated an algorithm itself designed to remove bias from language processing systems, such as stereotypical analogies associating certain terms like engineer with men and others like public relations more with women. Such bias might reflect actual distributions of people in those professions, but should have no place in a lexicon defining the words themselves. Bias prevention will become a feature of AI and ML as it continues to penetrate all aspects of computation in society and will be most effective if incorporated from the ground up in systems design. The design will also have to be flexible enough to allow for tuning during operation to eliminate any biases that emerge.

This is already familiar to earlier adopters of ML for content recommendations, where bias can creep in easily. Netflix was one of the first to tackle it, after deciding that ML could give it a competitive edge over rivals that were still confined to collaborative filtering for recommendation.

The latter is based on the assumption that if two people both like A and B and the first one is known to like C then it is worth recommending C to the second person. There is scope for greater sophistication than that by matching people on the basis of profile groups and incorporating different components such as opinions, as well as factors like time of day and device type. The matching can also be fuzzy, so that people can be given a recommendation even when their profile does not exactly match that of another person who definitely does like that content. But even so collaborative filtering is a blunt tool and suffers from various problems, such as the cold start syndrome whereby it takes time for a new user’s profile to be constructed until it starts to work. Until that time the new user will be issued recommendations almost at random (or derived from social media) to build up a portfolio of preferences.

Netflix decided early on to apply ML instead at a time when constructing the data sets was more time consuming than it is now with the help of automation tools. Netflix divided the recommendations problem into three components, users and their preferences, metadata describing content and the ML algorithms themselves. Netflix now has over 100 million subscribers worldwide, many of which are homes with multiple viewers so that there are about 250 million active user profiles. Associated with each profile is data about what people watch for how long and what they saw before and go on to do next. All of this could be catered for by collaborative filtering, but ML allowed Netflix to go to much deeper levels of profiling by assigning weights to large numbers of micro-events such as a person watching 10 minutes of some content before abandoning it, or binging through a given series on two consecutive nights. ML could build up weighted profiles that would allow users to be assigned to multiple overlapping taste groups that turned out to be more predictive of future viewing choices than collaborative filtering or any other conventional method. Traditional recommendation companies would argue that their own technologies have moved a mile away from straightforward collaborative filtering as well, but that’s another story.

At Netflix was a labor intensive exercise because the content itself had to be indexed with metadata to a much greater level of detail than had been done before in order to generate a sufficiently rich content data set for the ML algorithms to feed on. A team comprising around 100 in-house and freelance staff work full time watching shows or movies and tagging them with data spanning a wide range of attributes such as genre, nature of cast, individual actors and others involved in production and even how cerebral the material is deemed to be.

Netflix also invested heavily in the ML algorithms themselves right from the early days of online subscription VoD in 2006 when it offered a $1 million prize to crowdsource an ML or AI based movie recommendation algorithm capable of improving prediction accuracy by 10% over the existing system. The prize was finally awarded in 2009 and Netflix then built on that system called Feature-Weighted Linear Stacking, incorporating other ML based predictive models that combined data on popularity, predicted rating, correlation to those with similar interests, past ratings, perceived gender and age to produce final recommendations.

This was certainly successful given that now about 75% of all Netflix content viewed comes from recommendations rather than random searches or other means. But as the ML model grew in sophistication Netflix did run increasingly into ML bias, resulting from various factors, starting with obvious ones such as certain words describing content, such as “gritty drama”, not translating exactly from say English into French.

Netflix still uses essentially the same set of tags to describe content all over the globe, but to counter bias, has refined the model by separating out a sub-set of those tags that are most critical for the user interface and most sensitive to differences between countries, languages and cultures. These are localized by these factors, although for obvious reasons do not deliberately distinguish individual users by ethnicity even though that itself could be a predictive factor in recommendation.

Generation of content metadata is now being automated to reduce the cost involved, but that can introduce new biases. This process itself can be ML based with video technology firm Piksel among those to apply it in a system that learns how to parse content to recognize generic objects such as cats, specific people such as actresses and also attributes derived from the sound track through natural language processing. Object recognition can introduce bias as Google discovered in an extreme case when its newly introduced photo app categorized a black person as a gorilla in 2015, simply because during training it had had little exposure to either. ML from natural language is well known to be subject to unconscious gender, ethnic or other bias implicit in the way people use words, as a famous study published in the research journal Science Magazine in April 2017 showed. The study demonstrated how AI and ML algorithms “absorbed stereotyped biases”.

It is of course essential that such biases be eliminated as far as possible for social and commercial reasons, as well as to establish public confidence and credibility in the methods. But at a mathematical level all biases are the same and to some extent can be tuned. Firstly, there is irreducible bias inherent in the data measurement process, which is simply the accuracy of the data itself. The ML implementation can do nothing about that and the only recourse is to improve the measurement, although this bias can never be reduced to zero. In audience measurement it arises more from incomplete measurement to the extent that not all of a user’s activities are known, leading to some irreducible bias in the profile.

The system can trade off average bias against variance of prediction. At one extreme a system may make predictions that are very consistent, but all wide of the mark. At the other extreme a system may make highly inconsistent predictions whose average is accurate but few of which are individually on the mark. Obviously neither extreme is ideal so one aspect of ML tuning lies in reaching an optimum compromise, where neither variance nor average bias are too great.

ML is also being deployed increasingly for longer term prediction looking up to a year ahead where the aim may be to predict what sort of content to acquire or create. Measurement and analytics group Nielsen has been applying ML in its predictive service based on the audience data it collects in predicting future TV ratings at different levels, including individual content, hourly slots and whole channels, with varying success in each case. To some extent ML has been used here to help eliminate bias by cross-validating predictions against a separate dataset and repeating the process many times to reduce the tendency to “overfit” which means taking too much account of outlying data more likely to represent “noise” than actual trends.

Nielsen was also careful in its use of subjective data which can be valuable but also biased, such as the instinctive views of experienced programming executives over what might be popular during a coming season.

It all then comes down to the data, which is likely to remain a major challenge as ML and AI continue their inexorable advance, not just in TV but almost all industrial and commercial sectors.