Google’s machine-learning research team have unveiled MultiModel – a new neural network architecture that can handle multiple domains (image recognition, language translation, speech recognition, etc.), simultaneously. Declaring it a first step towards the convergence of vision, audio, and language inside a single neural network, it’s a very notable achievement for an industry that is on the brink of transforming computing.
The team notes that in the last decade, the capabilities of Deep Learning have progressed at an astonishing rate. However, they add that “the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: will a convergence between these domains facilitate a unified model capable of performing well across multiple domains?”
MultiModel promises to unite machine-learning techniques inside a single network, accommodating their highly-specialized functions as part of a greater whole. This would include the ML function for translating spoken word to written text, along with the machine-vision system that can identify written text in images, so that a single system could manage a function for converting a French road sign into spoken Spanish – for example.
As more neural networks are trained to carry out specialized tasks, they could be brought inside the MultiModel architecture – expanding its capabilities. Due to the constraints of the current technology, an image recognition neural network can’t be used in a speech recognition task – it would get less accurate at both tasks. As such, to tackle more complex tasks, there needs to be a means of using multiple highly specialized networks.
There’s a trade-off to be made, according to the results. MultiModel scored well in its tests, but worse than the most specialized tools – scoring around 86% accuracy, compared to the 95% accuracy made possible by single-function neural networks. That 86% is apparently on par with the best algorithms from five years ago, and it is expected that accuracy will improve.
However, MultiModel was able to boost the performance of sub-networks through managing tasks that were unrelated – finding that its ability to parse sentence grammar improved when it was trained with an image database. This is counterintuitive, but appears to show that the system is able to use unrelated data to improve its overall performance – which means that the system needs less training data to get it started.
The paper’s abstract adds that “our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance, and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly, if at all.”
The Google team noted that this came as a surprise, saying that it was possible to achieve good performance while jointly training multiple tasks, but that it the performance actually improved on tasks with limited quantities of data. “This happens even if the tasks come from different domains that would appear to have little in common, e.g. an image recognition task can improve performance on a language task.”
This finding is a break from the industry catchphrase ‘the best regularizer is more data.’ The multi-domain approach means that the data from one sub-net is useful to the others, and it appears that this could introduce significant training time improvements – thanks to simultaneous processing, and the need for less total training data, by volume at least.
The team says that the inspiration behind MultiModel stems from how the human brain processes multiple sensory inputs (sound, vision, taste), and creates a single shared representation that can be represented by language or actions. To this end, MultiModel uses modality-specific sub-networks for each of those tasks (audio, images, text), which are then united by a shared I/O Mixer module, which links the Input Encoders and Decoders.
The Encoder/Decoder essentially transforms data from each neural sub-network, which is occupied with its own specific modality, and looks to be the special sauce in the mix. In the Google demonstration, the MultiModel system is capable of learning eight simultaneous tasks – “detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and do grammatical constituency parsing.”
It looks like a way to create a model with interchangeable components, so that complex tasks can be performed using a neural network built-for-purpose – rather than creating a singular AI-system and asking it to solve all the world’s problems, throwing CPU cores at the problem until it goes away.
Google has open-sourced MultiModel, inside its Tensor2Tensor library, which is available on GitHub. The next steps will center on improving its performance, but the team notes that there are a lot of questions left to be answered.
The findings come shortly after Alphabet’s DeepMind subsidiary published its relational reasoning neural network, which has found a way to replicate the intuitive understanding that humans possess – letting DeepMind ask the system complex questions that can’t be answered using the current processes, such as “’what size is the cylinder that is left of the brown metal thing that is left of the big sphere.”