Now comes the hardest part, AMD: Software

0

From the moment the first rumors swirled that AMD was considering acquiring FPGA manufacturer Xilinx, we thought this deal was as much about software as it was about hardware.

We like that weird quantum state between hardware and software or programmable gates in FPGAs, but that wasn’t as important. Access to a whole set of new integrated clients was also very important. But the deal with Xilinx was really about the software and the skills that Xilinx has accumulated over decades creating very precise data streams and algorithms to solve problems where latency and locality matter.

After presentations from Financial Analyst Day last month, we reflected on one from Victor Peng, former CEO of Xilinx and now president of AMD‘s Adaptive and Embedded Computing Group.

This group mixes AMD’s embedded CPUs and GPUs with Xilinx FPGAs and has more than 6,000 customers. It grossed a total of $3.2 billion in 2021 and is poised to grow around 22% this year to around $3.9 billion; Importantly, Xilinx had a total addressable market of around $33 billion for 2025, but with the combination of AMD and Xilinx, the TAM increased to $105 billion for CETA. Of this amount, $13 billion comes from the data center market that Xilinx tries to address, $33 billion comes from embedded systems of all kinds (factories, weapons, etc.), $27 billion comes from the automotive sector ( Lidar, Radar, cameras, automated parking, the list goes on), and $32 billion comes from the communications sector (5G base stations being the big workload). Incidentally, this is about a third of the new and improved AMD’s $304 billion TAM for 2025. (You can see how this TAM has exploded over the past five years here. It’s remarkable, so we’ve reviewed it in detail.)

But a TAM isn’t a source of income, just a giant glacier in the distance that can be brilliantly melted down into one.

At the heart of the strategy is AMD’s pursuit of what Peng called “pervasive AI,” and that means using a mix of CPUs, GPUs, and FPGAs to cater to this booming market. It also means leveraging the work AMD has done to design exascale systems in conjunction with Hewlett Packard Enterprise and some of the world’s leading HPC centers to continue to flesh out an HPC stack. AMD will need both if it hopes to compete with Nvidia and keep Intel at bay. CUDA is a great platform, and oneAPI could be if Intel sticks with it.

“When I was at Xilinx, I never said that adaptive computing was the end of everything, being all computing,” Peng explained in his keynote. “A CPU is always going to drive a lot of the workloads, just like GPUs. But I’ve always said that in a changing world, adaptability really is an incredibly valuable attribute. Change is happening everywhere you hear about it, the architecture of a data center is changing. The car platform changes completely. The industry changes. There is change everywhere. And if the hardware is adaptable, that not only means you can change it after it’s made, but you can change it even when it’s deployed in the field.

Well, the same can be said for software, which of course follows hardware. Even though Peng didn’t say that. People were having fun with SmallTalk in the late 1980s and early 1990s after maturing for two decades due to the object oriented nature of programming, but the market chose what we would say was inferior Java a few years later because of its absolute portability thanks to the Java Virtual Machine. Businesses not only want to have the options of many different hardware specifically tailored to situations and workloads, but they also want the code to be portable in those scenarios.

That’s why Nvidia needs a CPU that can run CUDA (we know how weird that sounds), and why Intel creates an API and anoints Data Parallel C++ with SYCL as its Esperanto on CPUs, GPUs, FPGAs, NNPs and everything in between. with.

This is also why AMD needed Xilinx. AMD has a lot of engineers – well, north of 16,000 of them now – and a lot of them write software. But as Jensen Huang, co-founder and CEO of Nvidia, explained to us last November, three-quarters of Nvidia’s 22,500 employees are writing software. And it shows in the breadth and depth of development tools, algorithms, frameworks, middleware available for CUDA – and how this variant of GPU acceleration has become the de facto standard for thousands of applications. If AMD were going to have the algorithmic and industry expertise to port applications to a combined ROCm and Vitis stack, and do it in less time than Nvidia took, it had to buy that industry expertise.

That’s why Xilinx cost AMD $49 billion. And that’s also why AMD is going to have to invest a lot more in software developers than in the past, and why the heterogeneous interface for portability, or HIP, API, which is a CUDA-like API that allows times to running to target a variety of processors as well as Nvidia and AMD GPUs, is a key part of ROCm. This allows AMD to support CUDA applications much faster for its GPU hardware.

But in the long term, AMD needs to have its own comprehensive stack covering all AI use cases across its many devices:

This stack has evolved, and Peng will lead it from here on out with the help of some of those HPC centers that have leveraged AMD CPUs and GPUs as computational engines in pre-exascale and exascale-class supercomputers.

Peng didn’t talk about HPC simulation and modeling at all in his presentation and only touched lightly on the idea that AMD would create an AI training stack on top of the ROCm software created for HPC. Which is logical. But he showed how the AI ​​inference stack at AMD would evolve, and with that, we can draw parallels between HPC, AI training, and AI inference.

Here’s what the AI ​​inference software stack for CPUs, GPUs, and FPGAs at AMD looks like today:

With the first iteration of its unified AI inference software – which Peng called Unified AI Stack 1.0 – the software teams at AMD and former Xilinx will create a unified inference front end that can span compilers of ML graphs on the three different sets of compute engines as well as popular AI frameworks, then compile code to those devices individually.

But in the long term, with Unified AI Stack 2.0, ML graph compilers are unified and a common set of libraries covers all these devices; additionally, some of the AI ​​Engine DSP blocks that are hard-coded in Versal FPGAs will be moved to Zen Studio AOCC and Vitis AI Engine processors and compilers will be mixed to create runtimes for Windows and Linux operating systems for APUs that add AI inference engines for Epyc and Ryzen processors.

And that, in software terms, is the easy part. After creating a unified AI inference stack, AMD needs to create a unified HPC and AI training stack on top of ROCm, which is not so bad, and then the hard work begins. It’s about getting the nearly 1,000 key pieces of open source and closed applications that run on ported CPUs and GPUs so they can run on any combination of hardware AMD can. bring – and probably also on the hardware of its competitors.

It’s the only way to beat Nvidia and unbalance Intel.

Share.

Comments are closed.