Microsoft and Intel take on AI platforms with programmable chips

Microsoft has launched Project Brainwave, a new research program aimed at creating a system that can process inputs as soon as it receives them, for ‘real time’ AI tasks. The project sounds like good news for Intel, as Microsoft is relying on its Stratix 10 FPGA chips to power the platform. If the capabilities can be made to work commercially,  Microsoft hopes it would help its Azure cloud operation to become a serious cloud-based AI computing provider.

This is important to Microsoft Azure in the enterprise world, but especially the mobile one. AI-driven user experiences and applications, from personal digital assistants to extreme context awareness, are helping to define the next generation of Internet interaction, and increasingly, this takes place primarily on mobile devices. Yet Google and Facebook, with their huge penetration of the current mobile web experience, are in pole position to create the AI-enabled mobile world too.

Microsoft will need some significant differentiation to be noticed outside some particular enterprises where Azure has secured an entrenched position, hence projects like Brainwave.

For Intel, Microsoft’s endorsement is rather valuable, especially if other cloud computing providers pay attention and follow suit. For years, Intel has been able to count on the continued growth of computing requirements as a growth vehicle for its Xeon CPU sales – but the advent of AI-based applications and their bespoke hardware requirements has severely disrupted that long term projection.

Suddenly, Intel can’t be so sure of the incessant growth in demand for server CPUs – at least not on the scale it might have been expecting. Due to the advent of tasks that prefer GPUs (graphics processing units) and ASICs to CPUs, Intel might only be selling a few Xeon-powered racks in a cabinet, rather than filling the entire cabinet with Xeons.

The GPU accelerator cards, like Nvidia’s Tesla or AMD’s new Radeon MI25, mean that a server might only need two Xeons and four Teslas to complete a task that would traditionally have required a dozen Xeons. Consequently, Intel has moved to stay ahead of these new approaches, and its FPGA line looks like a promising answer to the question of machine learning compute hardware.

As for the wider IoT, Intel has always counted at being at the heart of the processing of the data gathered by IoT applications and devices at the edge of the network. While processing items like temperatures or meter readings is pretty simple, the emergence of AI-based computation overlaps with the rise of the IoT – and opens doors for new implementations and approaches for processing the vast quantity of data that is going to emerge from our increasingly hyper connected world.

The Brainwave announcement describes how the need for real time performance is becoming more important as cloud infrastructure processes more live data streams. It notes that Brainwave is going to be using the ‘massive FPGA infrastructure’ that Microsoft has been deploying in the past few years – using the programmability of FPGAs to adapt designs as required to better suit the requirements of the AI function at hand.

This is something that is unique to FPGAs – by writing an algorithm to the hardware itself, there is no need to use software as a middle-man translation layer, as is necessary in other types of chips, but which slows the process down.

Brainwave consists of three main layers, which are described as “a high performance distributed system architecture; a hardware deep neural network (DNN) engine, synthesized onto FPGAs; and a compiler and runtime for low friction deployment of trained models.”

Step two is where Intel comes in, with its FPGAs – Stratix models that were originally developed by Altera, before Intel acquired the company for $16.7bn. The chips are being deployed inside Microsoft’s Azure data centers, and act as a resource pool for specific DNNs – which are mapped to the chips, and can then be called by a server without the need for DNN software.

Microsoft says this architecture has very low latency, thanks to the CPU not needing to process incoming requests, with the FPGAs processing requests as fast as they are streamed. The FPGAs are using a ‘soft’ DNN processing unit (DPU) that has been adapted to work on the Stratix 10 FPGA. However, Microsoft notes that a number of companies are working on ‘hardened DPUs,’ – dedicated chips for the AI tasks.

Microsoft argues that these DPUs are at a disadvantage, compared to Brainwave’s reprogrammability. While the DPUs “have high peak performance, they must choose their operators and data types at design time, which limits their flexibility”, it said.  So while Microsoft is presenting a cost saving argument here, it also notes that the FPGAs allow it to more quickly incorporate new advances in machine learning software far more quickly and more cost-effectively than the DPUs.

In the same vein, the reconfigurability of the approach means that one should be able to support multiple machine learning methodologies. Microsoft says Brainwave’s software stack already support Google’s Tensorflow, a popular alternative, as well as Microsoft’s Cognitive Toolkit. Google’s Tensor Processing Unit (TPU) is exactly the kind of hardware that Brainwave is taking aim at.

While GPUs are popular now, it is unlikely that the industry won’t soon embrace a more efficient silicon architecture. GPUs are great at adapting to the requirements, but optimized designs for specific AI tasks will soon emerge – which will lose the general purpose performance of the GPUs, but gain improved performance in very specific tasks.

As hardware accelerators, those optimized chips will be used in conjunction with a coordinating processor – likely an Intel Xeon, given Intel’s dominance in the cloud server market. When used in conjunction with the correct software, and managed by the coordinator, the hardware accelerators will be able to carry out the functions for which they have been specially designed with extreme performance and efficiency.
However, any task deviating from the design’s capabilities will see performance suffer greatly – if the accelerator is even able to carry it out. Consequently, Brainwave is being pitched as a way to better mitigate the evolving AI-based landscape, which is unlikely to settle down any time soon.

As for on-stage benchmarking, Microsoft noted that rival approaches using DNN accelerators often use convolutional neural networks (CNNs) in performance demonstrations – in order to achieve high performance numbers off the back of the CNN’s high compute requirements.

Microsoft argues that this makes the results unrepresentative for more complex models, such as Long Short Term Memory (LSTM), Recurrent Neural Networks (RNNs), or Gated Recurrent Units (GRUs), which are used for natural language processing.

Microsoft adds that DNN benchmarks often use high levels of data batching in their tests, to inflate their apparent performance, which while useful for offline training or throughput-based architectures, is apparently not effective for the real time AI that Brainwave is aimed at. In Microsoft’s solution, Brainwave doesn’t need to wait for all the queries in a data batch to complete before moving on to the next task – thanks to that real time design.

This led to an on-stage demo that achieved a sustained 39.5 teraflop throughput while running a large GRU model – apparently five times the size of Resnet-50, the CNN that Microsoft used back in 2015 to first surpass human performance in the ImageNet dataset, which was a very big deal at the time. The demo didn’t require batching, and used Microsoft’s custom ms-fp8 floating point format – which led to the record-setting results.

Brainwave is expected to improve significantly over the next few quarters. Microsoft has been investigating FPGAs in AI-based tasks for a few years, publishing a report on its Catapult project back in October 2016, some five years after it began work. Catapult was initially concerned with optimizing the flow of data within a data center, using an FPGA to sit between the network and the servers, to route traffic more intelligently and also add compute resources where needed.