Tech 13 min read

How Far Has AMD ROCm Come in Catching Up to CUDA?

IkesanContents

I read an EE Times interview with AMD’s Anush Elangovan (AI Software VP), published on April 15, 2026.
It covers how close ROCm can get to CUDA, and I’ve been running a ROCm setup on Strix Halo myself.
To be precise, I’ve been “wrestling with an environment that keeps breaking,” so I’ll mix in my real-world experience alongside the article’s content.

Why GPUs for Inference?

GPUs are bad at complex computation.
They can’t handle fast branch prediction like CPUs or flexibly switch between different tasks.
So why are they used for AI inference?

The answer is simple: AI inference computation happens to be shaped exactly the way GPUs like.

CPUs have a small number of high-performance cores (a few to a few dozen) that execute complex instructions sequentially.
They pack in branch prediction, out-of-order execution, and large caches to speed up general-purpose work.
GPUs, by contrast, have thousands to tens of thousands of tiny cores that run the same operation on massive data simultaneously.
Each core is simple, but the degree of parallelism is orders of magnitude higher.

PropertyCPUGPU
Core countA few to dozensThousands to tens of thousands
Per-core capabilityHigh (complex instructions)Low (simple arithmetic only)
StrengthsSequential, branch-heavy workMassively parallel identical ops
AI inference fitPoor (not enough parallelism)Excellent (parallel matrix ops)

AI inference is essentially a pile of matrix operations.
Every time a large language model generates a single token, it runs multiply-accumulate operations across billions of weight parameters.
”Run the same simple computation on massive data in parallel” is a workload tailor-made for GPUs.

But raw GPU hardware alone does nothing.
You need a software layer that converts “a PyTorch model definition” into “parallel kernels running on the GPU.”
Matrix partitioning, memory layout, kernel scheduling: that conversion is handled by compute platforms like CUDA, ROCm, or Metal.

In other words, whether you can extract a GPU’s full performance depends on the quality of the software stack running on top.
No matter how much raw compute the hardware offers, if the software botches the translation, performance tanks or things just don’t run.
ROCm producing “garbage tokens” and Metal being slow on BF16 are software stack problems, not GPU hardware problems.

With that context, it becomes clear that CUDA is dominant and ROCm is struggling not because of a hardware performance gap but because of software ecosystem depth.

Why CUDA Is So Dominant

When GPU compute comes up, people tend to focus on hardware specs, but that misses the point.
CUDA’s dominance comes from the sheer thickness of its software ecosystem.

Since launching in 2006, CUDA has spent roughly 20 years building up the following.

  • Dev tools (Nsight, cuda-gdb, nvprof)
  • Libraries (cuDNN, cuBLAS, NCCL, TensorRT)
  • Documentation and sample code
  • A talent pool of CUDA-experienced engineers
  • Framework-level optimization (PyTorch, TensorFlow, JAX)

AI development runs on CUDA as a de facto standard.
Search GitHub for PyTorch code and you’ll find .cuda() calls hardcoded everywhere.
”Requirements: NVIDIA GPU” in an OSS README is perfectly normal.

This is vendor lock-in, yes, but from the user’s perspective, having “everything just work” is overwhelmingly convenient.
You encounter broken models far less often.

ROCm’s Official Strategy

Here’s what Elangovan laid out in the interview.

Nod.ai Acquisition for Compiler Infrastructure

AMD acquired Nod.ai, an AI compiler company, in 2024.
The goal is model optimization and improved portability across hardware, reinforcing the compiler layer that was ROCm’s weakest link.

Elangovan described ROCm as “an ecosystem sustained through ASIC firmware delivery,” indicating a strategy of iterative improvement across AMD GPU generations.

The OneROCm Initiative

“OneROCm” is AMD’s vision for unified acceleration across hardware variants.
The goal is a world where the same ROCm code runs on MI300X, Radeon RX 9070, and Strix Halo.

Porting from NVIDIA GPUs to AMD GPUs remains challenging, though.
The HIPIFY conversion tool exists, but the process is non-trivial for most developers.
In practice, routing through higher-level frameworks like vLLM or SGLang is more practical than direct code conversion.

Open Source Strategy

ROCm leans toward open source.
Rather than CUDA-style lock-in, it welcomes community contributions.
Cross-platform kernel development using OpenAI’s Triton framework is one of the community-driven approaches being explored.

As a strategy, this points in the right direction.
But looking back at what actually happened in my own environment, there’s still a significant gap between strategy and reality.

ROCm Broke Four Times on Strix Halo

From here on, this is my firsthand experience running ROCm on an EVO-X2 (Ryzen AI Max+ 395 / Radeon 8060S).
The “ROCm progress” described in the EE Times article is real, but the actual user experience is much messier.

Act 1: ROCm Won’t Run

When I set up a local LLM environment on the EVO-X2, the first wall was ROCm’s lack of support.
Ollama defaults to the ROCm backend, but Strix Halo (gfx1151) isn’t on the supported GPU list.
It either crashed at load time or fell back to CPU.

I ended up switching to LM Studio’s Vulkan backend to get GPU inference working.
Using an AMD GPU but needing to bypass AMD’s own ROCm via Vulkan from day one.

Act 2: Qwen 3.5 Total Failure

When I tried running Qwen 3.5 with Ollama, things got worse.

BackendResult
ROCm (Ollama)Loads but infinite garbage token loop
Vulkan (Ollama)Can’t even load, crashes
Vulkan (LM Studio)Loads but crashes on inference start
Metal (Mac)Works fine

ROCm produced a meaningless stream like “associates传递更多信息ivitprest” on infinite loop.
I initially suspected abliteration had corrupted the weights, but the official model had the same garbage.
The same model ran flawlessly on Mac’s Metal backend.
The model wasn’t broken. ROCm’s gfx1151 kernels simply couldn’t handle Qwen 3.5’s architecture.

Every backend failed, and there was no way to run Qwen 3.5 on Windows.

Act 3: Driver Update Fixes Everything

Updating the driver to 26.2.2 was like entering a different world.
Qwen 3.5 now ran at 34-54 t/s via Vulkan, and the VRAM shared memory priority issue was resolved.
The EVO-X2’s GPU was finally usable. Or so I thought.

Act 4: Driver 26.3.1 Breaks It Again

Updating to AMD Software 26.3.1 broke things again.
The symptoms were varied.

  • Loading a model via Vulkan succeeds but silently falls back to CPU (54 t/s to 9.5 t/s)
  • ErrorOutOfDeviceMemory failures despite 54GB of free VRAM
  • Corrupted output with mmap=true (a scrambled mix of Chinese, Japanese, and English)
  • Repeated Vulkan load failures leak device memory, unrecoverable until PC restart

The fix was changing the BIOS VRAM allocation from 48GB/16GB to 32GB/32GB.
Reducing VRAM paradoxically made Q6_K (26.8GB) loadable.
The driver uses system RAM as a transfer buffer during model loading, so insufficient system-side RAM causes failures even when VRAM has plenty of headroom.

Fourth round of troubleshooting.
The optimal BIOS VRAM split changes with every driver update on Strix Halo.

The Pattern

Four rounds of fighting revealed a pattern:

  1. Driver update breaks something
  2. Spend hours to days isolating the cause (driver? llama.cpp? model? BIOS config?)
  3. Find a workaround (BIOS change, --no-mmap, backend switch, etc.)
  4. Stable operation until the next driver update
  5. Return to step 1

It’s true that work that takes 5 minutes on CUDA can eat hours here, and isolating whether the fault lies in the driver, the app, or the hardware config is itself difficult.

Apple Silicon Is Stable but Slow

I also use a Mac (M1 Max 64GB) alongside the EVO-X2.
It lives in a different universe from ROCm’s chaos: the Metal backend is reliably stable.
But it was never going to be fast.

Benchmark Data

Use caseEnvironmentResult
Qwen 72B inferenceOllama / Metal5.3 t/s
Video gen (Wan 2.2, 2s)ComfyUI / MPS82 min
Qwen Image Edit (4-step)ComfyUI / MPS80s, degraded to 10 min after update
BF16 matmulPyTorch / MPS2x slower than FP16 (M1-M3 lack BF16 hardware)

Apple Silicon is “stable but slow.”
An H100 with CUDA finishes video generation in 2 seconds; the Mac takes 82 minutes.
The hardware is aimed in a different direction, so this is expected.
Unified Memory lets you sidestep VRAM limits for local inference, but raw compute performance doesn’t come close to NVIDIA GPUs.

MLX as a Bright Spot

Apple’s dedicated MLX framework is pointed in the right direction.
It bypasses PyTorch MPS issues (slow BF16, FP16 Attention bugs), and in ComfyUI benchmarks, mflux + MLX + Lightning LoRA matched PyTorch MPS speeds without the black-image risk.

That said, MLX’s ecosystem is still limited and supported models are few.
It’s arguably the best option for local inference on Apple hardware, but it won’t replace CUDA.

The GPU Compute Landscape

Laying out the hardware and software layers makes it clear that the competition is won on software stack depth, not hardware specs.

LayerNVIDIAAMDApple
HardwareH100 / B200MI300X / MI350M4 Ultra
Low-level APICUDAROCm (HIP)Metal
AI-specific librarycuDNN, TensorRTMIOpenMLX
Framework supportPyTorch (native)PyTorch (ROCm)PyTorch (MPS/MLX)
Ecosystem maturityDominantCatching upLimited
Inference serverTensorRT-LLM, vLLMvLLM (ROCm)mlx-lm

NVIDIA is self-contained across every layer, while AMD depends on the community from the framework layer up.
Apple is carving out a niche specifically for local inference.

graph TD
    A[AI Application] --> B[Inference Framework<br/>vLLM / SGLang / mlx-lm]
    B --> C[AI Framework<br/>PyTorch / JAX]
    C --> D1[CUDA + cuDNN]
    C --> D2[ROCm + MIOpen]
    C --> D3[Metal + MLX]
    D1 --> E1[NVIDIA GPU<br/>H100 / B200]
    D2 --> E2[AMD GPU<br/>MI300X / Radeon]
    D3 --> E3[Apple Silicon<br/>M4 Ultra]

    style D1 fill:#76b900,color:#fff
    style D2 fill:#ed1c24,color:#fff
    style D3 fill:#555,color:#fff

Practical Decision Guide

When to Choose CUDA (NVIDIA)

  • Running production AI services
  • Large-scale model training
  • You want OSS to just work out of the box
  • You want to minimize troubleshooting time

Nearly all AI code assumes CUDA.
The odds of hitting a bug are drastically lower.
When in doubt, CUDA. That hasn’t changed.

When to Choose ROCm (AMD)

  • Reducing GPU costs (MI300X is often cheaper than H100)
  • Avoiding vendor lock-in
  • You have engineers who can hack on OSS code
  • You can afford the time for troubleshooting

The upside is cost and openness.
But speaking from honest experience: “models that just don’t run” is normal, “driver updates break things” is normal, and “isolating the root cause when things break is painful” is the reality.
In LLM-jp-4 ROCm benchmarks it hit 62.9 t/s, so the raw performance is there.
AMD’s own tools like Lemonade are improving too.
But there’s a real chance that the money saved on GPU costs gets eaten by engineering hours.
On Strix Halo specifically, between VRAM allocation tuning and driver regression workarounds, you’ll spend time on everything except the GPU itself.

When to Choose MLX (Apple Silicon)

  • Local inference (stability over speed)
  • Agent workloads (lightweight models)
  • Development and testing environments

It runs entirely on Mac with efficient KV cache management thanks to Unified Memory.
Training is essentially out of the question, and OSS support is still limited.
Speed is orders of magnitude slower than NVIDIA, but the peace of mind of “it works” is real.
Not having “driver update broke everything” happen is a significant advantage over ROCm.

Decision Flowchart

graph TD
    Q1{Production service?} -->|YES| A1[CUDA, no question]
    Q1 -->|NO| Q2{Large-scale training?}
    Q2 -->|YES| A2[CUDA recommended]
    Q2 -->|NO| Q3{Cost is top priority?}
    Q3 -->|YES| A3[Consider ROCm<br/>Need troubleshooting capability]
    Q3 -->|NO| Q4{Local development?}
    Q4 -->|YES| A4[MLX / ROCm]
    Q4 -->|NO| A5[CUDA]

    style A1 fill:#76b900,color:#fff
    style A2 fill:#76b900,color:#fff
    style A3 fill:#ed1c24,color:#fff
    style A4 fill:#555,color:#fff
    style A5 fill:#76b900,color:#fff

Where Is the “Usable” Line for ROCm?

Drawing on my own Strix Halo experience with ROCm and Vulkan, here’s where the realistic boundaries are.

Usable

  • Operations within PyTorch’s officially supported scope (training and inference)
  • Major HuggingFace models (Llama family, Qwen family, Mistral family)
  • llama.cpp’s ROCm backend (hipBLAS). Only when the GPU target (gfxXXXX) is already supported
  • vLLM with ROCm support (MI300X or higher is realistic)
  • llama.cpp’s Vulkan backend. The practical escape route when ROCm doesn’t work

“Usable” and “reliably usable” are different things, though.
In Lemonade testing, the gap between Vulkan’s shared memory leaks and ROCm’s stability was apparent.
Things run, but whether they run with the same reliability as CUDA is another question entirely.

Still Difficult

  • OSS projects with custom CUDA kernels (manual porting required)
  • Inference pipelines dependent on TensorRT
  • Distributed multi-GPU training (environments without NVLink-equivalent interconnect)
  • Debug tooling equivalent to CUDA ecosystem tools (Nsight, etc.)
  • Stable operation on newer GPU targets (RDNA 3.5+ like gfx1151)

That last point is my experience exactly.
gfx1151 has immature ROCm kernels and Vulkan drivers; both Qwen 3.5 failing on all backends and Vulkan breaking after a driver update were consequences of being on a new GPU target.

Realistic Conditions for Migrating from CUDA

  1. The migration target code stays within PyTorch’s standard APIs
  2. No custom CUDA kernels
  3. You have engineers who can investigate ROCm bugs on their own
  4. Your schedule tolerates “it takes time to get things running”
  5. You can find workarounds when driver updates break things

Conditions 3 through 5 are effectively saying the same thing.
”Can you solve ROCm problems on your own?” is the single biggest factor in whether migration is viable.

Looking Ahead

Short-term, CUDA’s dominance won’t budge.
Mid-term, ROCm will gain presence as a “viable alternative.”
If intermediate layers like OpenAI Triton and Apache TVM (ML compilers) mature, multi-backend workflows could become realistic long-term.

The EE Times interview is Part 1; Part 2 may go deeper. I’ll add notes when it’s published.


I keep using both ROCm and MLX on Strix Halo, and work that takes 5 minutes on CUDA still sometimes eats hours here.
From initial setup to VRAM allocation battles, Ollama total failure, driver to the rescue, and driver breaking things again, I’ve lost count of how many times things broke and how many times I fixed them.
Even so, I think AMD is heading in the right direction, and with Lemonade and LLM-jp-4 ROCm benchmarks, ROCm’s practical utility in local environments is steadily improving.
But between AMD’s vision of “the future” and what users experience “today,” there’s still one driver regression’s worth of gap.