Advancing AI: AMD Unveils Next-gen AI Solutions for Data Centers, Challenging NVIDIA, Intel

0
Oct 16, 2024
  • AMD’s new generation of EPYC CPU (Turin), Instinct GPU (MI 325x, MI350 series) and Pensando Networking DPUs and Smart NICs (Pollara and Salina) cements its position in “full system”-level offerings for AI servers.
  • AMD’s EPYC (Turin) will further boost AMD’s growing traction in the server and data center CPU market, especially in the x86 space against Intel.
  • AMD’s latest portfolio of GPU accelerators expands with the full range of the MI300 series offering more choices for hyperscalers and enterprises beyond NVIDIA.
  • AMD’s next-generation networking portfolio and ROCm AI platform round up the end-to-end offering to compete head-on with NVIDIA.

AMD CEO Lisa Su unveiled the company’s latest product portfolio at the Advancing AI event, held in San Francisco recently. Counterpoint analysts attended the event at the invitation of AMD. The cutting-edge end-to-end portfolio is built to advance AI from cloud to edge.

Before unveiling the latest generation of offerings, AMD showcased the progress of its current generation in the data center segment and the extent of partnerships and openness it had created within the ecosystem. On the stage were officials from key hyperscalers (Microsoft, Google Cloud, Meta, IBM, Oracle) to OEM/ODM partners (Databricks, Lenovo, Cisco, Supermicro, HPE, Tensorwave). In the demo zone were QCT, Gigabyte, Vultr, ASRock, Micron, Dell and many more.

Some key insights from the partners showcasing growing traction for AMD:

  • Google Cloud leveraging AMD EPYC processors for general purpose, confidential computing, high-performance compute, AI Hypercomputer, a supercomputing architecture and more. EPYC 9005 series-based VMs will be available on Google Cloud in early 2025.
  • Oracle Cloud has some successful AMD CPU instances with major customers such as PayPal and Bank of Brazil, and a growing proportion of the AMD GPU MI 300 series instances for GenAI workloads.
  • Databricks is seeing a 50% improvement in workload performance with AMD’s MI 300 series accelerator GPUs. AMD’s ROCm AI software has been great for migrating workloads from other environments seamlessly and “without any modifications”.
  • Microsoft also has a good enough fleet of MI 300 accelerators in Azure and is seeing significant performance benefits in training the GenAI GPT models.
  • Meta has been one of the biggest AMD customers, having deployed more than 1.5 million AMD EPYC CPUs. Further, Meta has trained, fine-tuned and served live traffic for the 405B Llama 3.1 model exclusively on AMD’s Instinct MI 300 series.
  • Essential AI, Fireworks AI, Luma AI and Reka AI are optimizing their models across AMD hardware seamlessly with AMD’s ROCm software.
  • Dell, HPE, Lenovo and Supermicro are ramping up their portfolios with the latest-generation EPYC CPUs.
  • Leading players such as Lenovo are now seeing more than half of the server system sales coming from those based on AMD CPUs in Tier-2 CSPs. These sales are poised to grow 70% YoY this year. AMD’s footprint within Lenovo’s deployments has gone up 3x in the last three years.
  • AMD has delivered more than 350 server platforms and over 950 cloud instances powered by its solutions.

Building on this success and traction, AMD has launched a new generation of its CPUs, GPUs and DPUs for networking and AI software solutions.

A screenshot of a computerDescription automatically generated

Source: AMD

𝐄𝐏𝐘𝐂 𝐃𝐂 𝐂𝐏𝐔𝐬

  • First and foremost is AMD’s fifth-generation x86-based server CPU EPYC Turin, which aims to extend the company’s overall server leadership beyond a third of the market, with 50% of its share coming from North Star.

Source: AMD

  • Great performance improvement with Turin 3/4nm configuration, which enables up to 192 cores with Zen 5/5c architecture for both ‘Scale-Up’ and 'Scale Out', with a massive 17% IPC uplift and compatible footprint with Genoa.
  • For comparison, AMD’s first-generation EPYC launched in 2017 sported just 32 cores. This means almost 6x increase in the core count across five generations, which translates to almost 11x performance.
  • The latest Turin EPYC is seeing significant performance improvements across various applications, from database, media processing, supercomputing and HPC (like dense linear solvers and molecular dynamics) to machine learning and E2E AI performance when paired with GPU.
  • AMD has made a very interesting claim of almost 7:1 consolidation in effectively replacing 1,000 2P Intel Xeon Platinum 8280 servers with just 131 modern 2P AMD EPYC 9965 servers, which means 87% fewer servers, 68% less power, 77% lower TCO and ample space savings.
  • Partners such as Micron Technology and Supermicro showcased CXL2.0 module-based designs to help scale up and lower the TCO and boost memory performance by 67%.

AMD EPYC CPU Adoption from Hyperscalers to Enterprises

Source: AMD

AMD also unveiled its latest generation of GPU accelerators to build upon the momentum achieved by its Instinct MI300 series, which has been adopted by all the leading hyperscalers.

Instinct GPUs and ROCm AI software

  • AMD announced the MI325X (4nm TSMC, HBM3E), based on the CDNA3 GPU architecture, and the latest-generation Instinct MI350 series (3nm TSMC, HBM3E), based on its new CDNA4 arch and FP4/6 support with 7.4x AI FLOPS and 1.5X HBM memory and 6X model size support increase.
  • AMD is claiming almost 35x performance for its latest CDNA4 when compared to the CDNA3 architecture, driving 7x AI compute and 1.5x memory bandwidth.
  • AMD’s GPU is on a one-year cadence compared to the CPU which is on a two-year cadence. The next GPU on the roadmap is the MI 400 series in 2026, which will feature CDNA Next architecture. Observers will be eager to know whether AMD continues with TSMC’s 3nm process or moves to 2nm for the MI 400 series.

Source: AMD

  • Great work done with AMD’s open-modular ROCm software stack adding newer algos, libraries and expanded support for major frameworks and models to empower developers to build and optimize AI models and workloads on the GPUs.

AMD Instinct GPUs Adoption Across Leading GenAI Platforms via OEM Solutions

Source: AMD

  • The new ROCm v6.2 brings in over 2x performance improvements for runtime optimization, kernel fusion, communication and more across various foundational models from Mixtral to Llama.
  • It remains to be seen how developers and CSPs view ROCm when compared to CUDA. Expect a key battle for differentiation here.
  • Over 1 million Hugging Face GenAI models are now optimized on the MI 300X series GPUs with out-of-box support.
  • Day Zero support for Meta Llama 3.1/3.2 and PyTorch, and vendor agnostic compiler support for Triton, ensuring deep developer engagement.

AMD also unveiled its most advanced networking solutions, with programmable DPUs and smart NICS for driving front-end and back-end network efficiencies and scalabilities. AMD has caught up with NVIDIA here to complete the end-to-end system offering, with more capable networking solutions alongside compute hardware and software.

𝐏𝐞𝐧𝐬𝐚𝐧𝐝𝐨 𝐃𝐂 n𝐞𝐭𝐰𝐨𝐫𝐤𝐢𝐧𝐠

  • This is a very important development, as the latest third-generation Pensando P4 engines with the 400G PCIe Gen5 back-end and front-end DPUs and smart NICs are crucial to have end-to-end system-level AI performance.

A computer chip with text and numbersDescription automatically generated with medium confidence

Source: AMD

  • AMD’s Pensando Salina 400 DPU is a great upgrade for evolving front-end networks for SDN, encryption, security, load balancing NAT and other functions offloaded from CPU.
  • AMD’s Pensando Pollara 400 UEC Ready AI NIC takes programmability to the next level, enabling efficient congestion control and RDMA transport for sustained data transfer and higher network utilization.
  • AMD is pushing for Ultra Ethernet Consortium Ready RDMA solutions for 5X faster completion times, driving lower TCO and better scalability compared to InfiBand solutions.

Key Takeaways

  • AMD has done well in driving its leadership in server CPUs with the EPYC line of compute solutions, which are being well adopted by Tier-1 and Tier-2 CSPs as well as enterprises.
  • AMD’s crossover point in the CPU segment of more than 50% share is something the company thinks is achievable by 2026, while GPU competition with NVIDIA, which owns the lion's share here, is going to be slightly longer to reach the 50% level for AMD.
  • Building a strong ecosystem of partners and open software solutions has enabled AMD to strengthen its market position to get closer to NVIDIA in terms of adoption and traction.
  • The recent acquisitions of Silo AI and ZT Systems should further help AMD in supporting its ecosystem partners to accelerate their models, applications and solutions from the time-to-market perspective on AMD hardware.

Source: AMD

Summary

Published

Oct 16, 2024

Author

Neil Shah

Neil is a sought-after frequently-quoted Industry Analyst with a wide spectrum of rich multifunctional experience. He is a knowledgeable, adept, and accomplished strategist. In the last 18 years he has offered expert strategic advice that has been highly regarded across different industries especially in telecom. Prior to Counterpoint, Neil worked at Strategy Analytics as a Senior Analyst (Telecom). Neil also had an opportunity to work with Philips Electronics in multiple roles. He is also an IEEE Certified Wireless Professional with a Master of Science (Telecommunications & Business) from the University of Maryland, College Park, USA.

Back To List