AI silicon creeps onward, hopes for ‘unreasonable effectiveness’

Intel’s recent statement that there will be no one-size-fits-all for AI chips, that there will be room for CPUs, GPUs, FPGAs and ASICs, is really just common sense – and replicates what has happened in the past for industrial, scientific and even general-purpose computing.

The trend towards massive scale in the cloud has driven more computing towards general-purpose off the shelf processors, but at the same, stimulated demand for more specialized hardware in edge processors, and to address the Internet of Things (IoT). Meanwhile, the AI boom is driving development of new generations of dedicated chips, optimized for specific algorithms for neural network-based machine-learning.

Intel itself now has or is developing chips optimized at least to some degree for AI in all four categories – CPU, GPU, FPGA and ASIC. The underlying point here is that chips described as general-purpose still can and do incorporate specialized logic for particular tasks, as they have done almost since the dawn of silicon. As AI becomes increasingly synonymous with general-purpose computing, its algorithms will be supported in general CPUs.

This is a trend that has already been noted and exploited by some start-ups, such as Boeing’s drone-making subsidiary Insitu, which has for some years been using FPGAs for in-flight computation. However, it recently announced that advances in AI-optimized GPUs from Nvidia and most recently Intel, as well as even CPUs, meant it was becoming more cost-effective to adopt these instead. These also dovetailed well with developments in edge computing around the big clouds from Amazon and Microsoft, Insitu added.

At the same time though, with AI becoming highly competitive, there will be strong demand for dedicated hardware capable of delivering the highest performance possible in a given footprint for some processes where speed is essential. For this reason, the big chip makers are investing in ASICs as well as FPGAs, noting again that the distinction between these is also more blurred than it used to be.

Broadly, ASICs still deliver the greatest power in a given silicon footprint with more onboard memory, because the logic circuitry is totally optimized for a single process. For this reason, Intel snapped up Nervana for around $408 million in August 2016, a lot for a 48-person start-up even in machine-learning. This fed directly into development of Intel’s Neural Network Processor (NNP) range of ASICs, dedicated to machine-learning with the surprise being that the first models are not due for release until 2019. This will comprise 12 cores based on Intel’s “Lake Crest” architecture with of 32GB of on-board memory and performance of 40 TFLOPS (at undisclosed precision), along with theoretical latency of under 800 nanoseconds for 2.4 Terabits per second bandwidth interconnects.

Lake Crest then has three components – that high bandwidth memory for fast access on board, those fast interconnects so that multiple chips can assembled into a super virtual chip, and most interestingly a feature called Flexpoint. This is an Intel development for reducing the bit width needed to conduct high precision arithmetic in AI and therefore achieve higher compute densities, resulting in reduced power consumption and greater speed. It would take a review paper to describe that – indeed such a paper exists – but the essence is that it is possible to cut corners and reduce precision for floating point arithmetical calculations below 32 bits for specialized tasks by exploiting features of the process.

The key point here is that for neural network-based training there is a lot of in-built redundancy in the processes of calculation, based on weights and activations of circuits, so that it is possible to reduce precision of each parameter significantly, in some cases right down to binary ones or zeroes, without affecting accuracy of the final inference.

Providing results of aggregated calculations are kept at high precision, the rest can be done surprisingly roughly, leading to use of the term “unreasonable effectiveness” in academic circles. This finding was originally unexpected since it seemed more intuitive that errors would accumulate through computation rather than diminish – but that failed to account for the corrective power of the redundancy. So “training at low precision” is now an active field of research with further advances likely beyond Intel’s.

The chip-maker had earlier in December 2015 spent even more heavily on FPGA maker Altera to the tune of $16.7 billion, although that was with the whole IoT field in mind and not just AI. Being bigger, this is taking even longer to digest, meanwhile smaller rivals are picking up the running with a strong focus on edge computing dedicated to IoT applications and services.

One of these is Lattice Semiconductor, which has just introduced two hardware AI algorithm accelerators for its FPGAs, one as it happens for binarized neural networks (BNNs) of the sort Intel is focusing on where weights are reduced to 1s and 0s. The other accelerator is for convolutional neural networks (CNNs). Both are aimed at neural networks in consumer and industrial network-edge products and are not designed for network training, which must be done elsewhere.

This second CNN accelerator supports not just binary 1-bit data but also 8 and 16-bit data for both weights and activation and is therefore designed for wider applications than pure IoT or data analytics, including video processing.