FPGA specialist Tachyum has made an outlandish claim this week, by declaring its new Prodigy line of reprogrammable silicon as the “first universal processor.” With big performance claims to boot, the Prodigy FPGA is being positioned to target all manner of data center workloads, with video being a key use case.
Of course, Faultline is pretty familiar with such grand claims, and we are used to sifting through some very dubious press releases. Tachyum, thankfully, does not fall into that camp. CEO and co-founder Radoslav Danilak talked us through the road that led to the Prodigy.
The concept behind the universal processor is one that can properly handle the three main silicon approaches needed in enterprise workloads – CPU, GPU, and AI-focused designs like Google’s TPU (Tensor Processing Unit).
Danilak stressed that the Prodigy is not simply three technologies in a trench coat, and the key feature is the ability to repurpose one of these chips to match the dynamic workloads seen throughout the day.
On this point, Danilak highlighted the problem that these idle servers pose. Data centers, according to Tachyum’s research, represent 4% of total global power consumption, but this is a figure that is growing at 15% per year – and faster during Covid. At this pace, it would reach 40% by 2040, and so something needs to be done.
Worse, much of this power is being spent on servers doing nothing. Facebook recently published data showing that the average utilization of its servers over a 24-hour period is around 30% – mimicking human activity cycles. Due to latency issues, you cannot simply run a global data center, and serve EU customers with a US rack.
The obvious thing to do here is to use the idling servers for something that is productive.
In the Facebook example, Danilak pointed to how these servers should be training the AI models that Facebook increasingly relies on, but here is where the differences in silicon architecture come in – these servers are not physically suited for that sort of job.
And so, the pitch is that a product like Prodigy would allow companies to actually get the most out of their investments – either for internal jobs, or providing third-parties with compute power. Danilak pointed to Microsoft selling surplus Azure AI cycles to the US government, for video surveillance processing.
The scale of the problem seems enormous. Danilak said there is around $300 billion of commissioned IT equipment currently, and at that 30% utilization rate, this effectively means $180 billion has been wasted. Here, Tachyum proclaims that using its chips would allow you to avoid that headache, and save on both cost and power consumption – as idle servers are still consuming rather a lot of electricity.
The key difference in Prodigy to the CPU, GPU, and TPU rivals is that Tachyum spent a lot of time developing a different way to move data around inside the chip – a new instruction set, effectively.
Over the past few decades, there have been immense improvements in semiconductor performance, with the raw speed of transistors improving between 6x and 8x. Moving more elements from the PCB inside the chips themselves has also cut latency significantly, but the industry has been grappling with a slowdown in Moore’s Law simply due to the rules of physics.
As the process size of these chips has decreased, the internal wire size has too. This unfortunately means that the electrical resistance of the wire has increased, which put simply means that a 10x smaller wire ends up being 100x slower.
Danilak said that these wire delays are now limiting the performance of functional blocks,” and that “as time has moved on, more processing time is spent in the wire than the transistor.”
This forced the company to create a different approach for the internal operations.
“The big question was if we could break the process of moving the data around and the calculations into two different processes,” said Danilak.
“This allows us to do the next calculations at the same unit that just produced the data, without having to move data around the wires. That’s the simple way to explain it. We are moving data around the 8 positions in the unit in a way that is faster in about 93% of cases. We can’t solve the physics problem, so have to find a way around it,” Danilak continues.
Danilak added that if you are not moving data around, you are also not using electricity to do so. This leads to a claimed 3x to 10x improvement in power consumption. In the variety of performance throughput benchmarks shown to Faultline, the lowest ratio was 3.3x, with the highest being 17.1x.
Danilak stressed that 6x to 8x is a sensible expectation, with a rack-level comparison showing one air-cooled server rack of Prodigy chips scoring 4.8x the performance of rivals. Liquid cooling improves this further, often doubling it, and the best scenario saw one rack of these Prodigies do the same work as 13 Nvidia DGX racks. Liquid cooling also allows for energy recovery projects like district heating, which would allow data centers to offset their carbon emissions further.
Pricing is not finalized, and this was the only time that Danilak did not have a concrete answer for us. Instead of providing a price, he discussed the cost of these chips compared to rivals. Tachyum is using TSMC and Samsung to produce these 5nm chips, and Danilak said that between $3k and $4k of Prodigies should provide the performance of $10k in rivals.
Pricing will be confirmed later in the year, once production volumes give a clearer cost expectation. Sampling is scheduled for the second half of the year, and the second generation of Prodigy chips is expected in 2024. Tachyum is working on a total addressable market (TAM) estimate of $3 billion to $4 billion, saying that the AI-based market is around 10% of the $30 billion spent on microprocessors for the cloud. The percentage will increase in time.
Currently, Tachyum is helping to assemble servers, and has its own motherboard reference design to aid in this. Eventually, Tachyum wants to sell just the chips, and once its server partners no longer need Tachyum reference designs, Tachyum will stop making its own servers.
The Prodigy family spans from a 32-core chip to a 128-core model, designed to cater for all data center budgets. The 128-core variant has a monstrous 5.7 GHz clock speed, supports 32 TB of RAM, with a 200 Gbps Broadcom Ethernet interface, and the architecture lets customers run native x86, ARM, and RISC-V.
Tachyum has invested heavily in creating the software ecosystem to support its new instruction set too, noted Danilak. In terms of physical size, the chip is around 500mm2, which is significantly smaller than an Nvidia H100, at about 800mm2, and Intel’s leading Xeons, which measure some 650mm2.
The team which includes Steve Furber and Fred Weber, who Danilak described as the father of the ARM instruction set and the person that developed AMD’s 64-bit x86 instruction set respectively, the latter of which was so much better than Intel’s attempt that it continues to license it to this day. “Having two people responsible for 98% of computing running today in Tachyum is pretty amazing,” said the CEO.