Qualcomm’s ISP shows AI’s influence in traditional silicon design

Qualcomm’s claim to have launched the world’s first AI-based Image Signal Processor (ISP) must be seen in the context of wider trends in silicon design optimization, for algorithms, and inevitably, in the hype around the AI field. The claim is a bit of a stretch, because what Qualcomm has done is pull the compute units of the CPU, GPU and DSP (Digital Signal Processor) of its latest  Qualcomm Snapdragon 855 mobile platform out into its latest Spectra 380 ISP.

It should really be seen then as part of the overall architectural evolution of Snapdragon, highlighting the importance and computational demand of image processing in smartphones today. It frees up cycles from the CPU, GPU and DSP (Digital Signal Processor) for non-image tasks, while accelerating computer vision (CV) processing by locating the relevant components close together.

This tighter integration of hardware-accelerated CV capabilities allows the Spectra 380, and therefore the whole Snapdragon mobile platform, to classify and segment visual objects much faster with greater depth sensing, while cutting power consumption by up to 75%, according to the company. The latter helps meet the ever-present challenge of keeping a smartphone going through a day without recharge, given that image processing is a major source of power drain.

This can help render special effects such as fuzzy ‘bokeh’ video, as well as 4K HDR (High Dynamic Range) video capture, which is computationally intensive – because it is essentially faking the effect of a large camera lens. The object segmentation will also allow real-time background swaps, which is a requisite for augmented reality (AR) and virtual reality (VR).

This integration of CV components into the ISP resonates with the trend for chips to be designed for running intensive AI and machine learning algorithms. Qualcomm’s Snapdragon 855 also brings an improved DSP, the Hexagon 690, which doubles the number of vector accelerators and brings it closer to being a neural processor for machine learning.

Such trends raise a point which is perhaps missed by many in the AI field, to judge from the recent AI World conference held in Boston, attended by Riot. Among numerous presentations involving over 200 speakers there was just one panel debate devoted to AI hardware and that was sparsely attended by only about 20 people. Yet the current renaissance in AI is largely the result of the spectacular advance in hardware and especially GPU performance. This has improved so much that the bottleneck retarding execution of many AI and especially machine learning algorithms is now I/O rather than compute.

The latest GPUs, such as Nvidia’s Tesla T4 AI chip, boast impressive specifications, running almost 12 times faster than the preceding P4 chip at the half precision FP16 floating point arithmetic relevant for AI calculations. But the problem can lie in keeping them fed continuously with data during the training of machine learning algorithms in particular. If data does not arrive like a production line, ready to be used at the optimal time, the chip ends up idling, and in practice this can account for a remarkable proportion of the time taken for active algorithm training.

According to Andy Watson, CTO of WekaIO, which has developed a parallel file system optimized for flash memory, a GPU can be idle for up to 99% of the training time, with obvious scope therefore for slashing the duration of training. Many in the field have come to accept training times in the order of weeks without realizing that the GPUs involved are performing very inefficiently. Watson cited a case of a self-driving car system where the training time was cut by 80x, bringing the total down from two weeks to four hours, giving much greater scope for experimentation as a result of eliminating excessive wait states.

WekaIO’s file system, called MatrixFS, aggregates local SSDs (Solid State Drives) inside the servers into one logical pool, which is then presented to host applications as a distributed and massively parallel file system. This has attracted the attention of a few partners involved in high performance computing, including the San Diego Supercomputer Center, used for applications that are part of the grand challenges of science outlined by the US National Science Foundation (NSF). This fast I/O is therefore still work in progress, but WekaIO can at least point to impressive utilization gains achieved by streamlining deliver of data from such fast flash layers to GPUs. We anticipate this being a growing field spanning AI and high-performance computing research.