AMD nets one-rack petaflop, in serious challenge to Intel, Nvidia

AMD has unveiled Project 47, a rack housing 20 servers that is capable of a petaflop of compute power – an unprecedented use of space. The very impressive system, if it lives up to the claims, threatens to encroach on Intel’s dominance in the cloud, and potentially spoil Nvidia’s party for its GPU accelerator cards. AMD is trying to sew up the machine-learning market, which is going to be put to task processing data pulled from IoT devices and systems.

For the IoT, and the vast amount of data that will be generated by the network edge and will require processing inside centralized cloud applications, advances in these markets are always quite relevant. In addition, the new AI-based computing applications and machine-learning techniques will be absolutely vital to doing something actually useful with all that date – sorting it into those delicious ‘actionable insights’ that help improve margins and efficiency.

The Project 47 system poses a real threat to Intel in the emerging AI and machine-learning markets, as well as in more traditional applications – the latter of which were more prevalent in the launch. Intel has been battling with Nvidia’s GPU accelerator cards in machine-learning systems, due to the cards’ ability to replace the need to use dozens of Intel Xeon CPUs.

With the accelerator cards, an application might only require a fraction of the CPUs that the workload would previously have required – something that has greatly upset Intel’s future growth in this nascent market. Intel is still absolutely dominant in its core server markets, but this new high-growth market doesn’t look as cozy as it might initially have done for Intel.

AMD will, of course, want to make inroads into that core market still, and while Intel has tried to counter the likes of Nvidia’s Tesla P100 with its Xeon Phi accelerator card, AMD is also gunning for its Xeons – hoping to tempt developers and cloud computing providers to use the new Epyc line, instead of the incumbent Intel Xeons.

We first saw the Instinct MI25 back in December, and since then, AMD has been very busy launching its new line of Ryzen CPUs – its first real challenger to Intel’s dominance in around five years. AMD is similarly poised with its new Vega consumer and professional GPUs, this time looking to counter Nvidia’s dominance.

But in pure machine-learning workloads, custom silicon is likely to eventually surpass GPUs, as the prime-choice for such workloads. Currently, Google’s Tensorflow and its Tensor Processing Units (TPUs) are a glimpse of the direction that the industry will likely eventually head – where application-specific silicon emerges thanks to its optimization.

The Project 47 rack houses 20 of AMD’s new Epyc 7601 CPUs, but more importantly, 80 of its Radeon Instinct MI25 GPU cards – along with 10TB of RAM, 20 100-gigabit Infiniband switching cards and a dedicated switch from Mellanox, for networking the racks together inside the rack. Samsung is providing the RAM, the NVMe flash storage, and the HBM2 memory in the MI25s. The whole rack uses 33,000 watts.

Crucially, this represents a new level of performance per watt and per rack – a claimed 30 gigaflops per watt in single-point precision, or around 25% more power efficient. HPCwire notes that this claim is potentially a little misleading, “given that the P47 system doesn’t offer much in the way of double-precision arithmetic. Machines from HPE, NEC, Fujitsu, and Dell offer between 28-50 gigaflops per watt,” although those are in much larger housings than Project 47.

A single petaflop of compute performance isn’t impressive compared to the most capable supercomputers, but that fact that AMD and Inventec have squeezed it out of just one rack is very impressive. No pricing has been confirmed yet, but AMD claims its per-dollar performance is superior to rival implementations. It is scheduled for release in Q4.

The individual rack-mounted servers are based on Inventec’s P47, a 2U-sized server for housing one of the Epyc 7601 CPUs, and 4 of the Radeon Instinct MI25 GPUs. In Intel-based platforms, running four GPUs typically requires at least two CPUs, and more expensive networking to support the required PCI lanes for communicating with the peripheral hardware.

Mentioned onstage in the unveiling as a demonstration of the scale of Moore’s Law, the Roadrunner supercomputer that AMD built with IBM back in 2008 used 6,480 dual-core Opteron 2210 CPUs, 12,960 IBM PowerXCell 8i cell processors, and about 102TB of RAM, housed in custom blade servers built by Infiniband.

It was built for the Los Alamos National Laboratory, for simulating nuclear material decay – something IBM is still involved with, using its TrueNorth neuromorphic chips. Roadrunner was the first hybrid supercomputer, using those Opterons to coordinate the PowerXCell accelerators that were doing the heavy lifting. With Project 47, AMD is providing both pieces of the puzzle.

The first version of Roadrunner was housed in 296 server cabinet, occupying some 6,000 square-feet. It first went live in 2006, in phase one tests, but in its first proper outing in 2008, the machine finally passed the petaflop barrier, scoring 1.026 petaflops – eventually hitting a height of 1.7 petaflops.

It cost some $100m to build, and was relatively inefficient with electricity – using around 2.345 MW (2,345,000 Watts – your average desktop uses around 400W), scoring around half the megaflops-per-watt of comparable rivals. Roadrunner was replaced by Cielo, a machine that cost ‘just’ $54m.

AMD’s new Project 47 system can now apparently achieve that petaflop performance in just one rack – and can hit two petaflops in half-point (16-bit) floating point precision. Each Epyc 7601 houses 32 processing cores, meaning that the entire rack uses 640 cores – 10% of what Roadrunner required. Instead of the IBM cell processors, Project 47 is relying on those Radeon MI25 GPUs.

Consequently, the system uses about 98% less power than Roadrunner, and only a single rack – apparently just 0.07% of the physical space that Roadrunner needed. In terms of scaling it up, AMD believes that it could hit an exaflop of compute power using a thousand Project 47 racks, but this would require 33.3MW of electricity (and a huge amount of cooling).