Wan 14B & FastWan on Radeon 8060S: ZLUDA fails, TheRock gfx1151 wheel works

I keep redoing local video generation on different machines. On an M1 Max 64GB, Wan 2.2 took 82 minutes for 2 seconds; on an RTX 4060 Laptop 8GB, FramePack F1 took 56 minutes for 5 seconds. Both “run but slowly,” and the wall was on the memory side, not the GPU.

This time it’s the EVO-X2. VRAM 48GB is unlike the 4060’s 8GB or the M1 Max’s shared 64GB; VRAM won’t be the wall. The thing I’m unsure of is whether PyTorch runs on an AMD GPU at all. The previous two boxes were CUDA and Metal, where “PyTorch runs” was a given. With ROCm you have to question that first.

Test setup

Item	Value
Machine	GMKtec NucBox EVO-X2
CPU	AMD Ryzen AI Max+ 395 (~3000 MHz)
GPU	AMD Radeon 8060S (RDNA 3.5 / gfx1151)
Memory	64GB UMA (BIOS split: VRAM 48GB / System RAM 16GB)
OS	Windows 11 Pro 26200
PyTorch	2.11.0+rocm7.13.0
diffusers	0.38.0
Python	3.11.15 (Miniconda)

Let me cover UMA first. UMA (Unified Memory Architecture) means the CPU and GPU share the same physical memory; on Strix Halo the BIOS splits the 64GB. Here that’s 48GB to VRAM and 16GB to system RAM. The GPU side is plentiful, but the CPU side has only 16GB. This asymmetry stays a constraint all the way through the second half.

Getting PyTorch to run on an AMD GPU

This took the most time. To lay out what I did up front: every CUDA-compat layer failed, and I ended up on AMD’s official Windows ROCm wheel. Three attempts, in order.

flowchart TD
    A[Get PyTorch running on ROCm] --> B[Attempt 1: ZLUDA + CUDA PyTorch]
    B --> B1[GPU detected OK<br/>53.9GB VRAM visible]
    B1 --> B2[All ops: named symbol not found<br/>even addition fails]
    B2 --> C[Attempt 2: official ROCm wheel]
    C --> C1[Not shipped for Windows<br/>No matching distribution]
    C1 --> D[Attempt 3: AMD TheRock gfx1151 wheel]
    D --> D1[All ops OK<br/>native ROCm, no ZLUDA]

Attempt 1: operations don’t go through

I tried ZLUDA first. ZLUDA is a compatibility layer that converts CUDA binaries to AMD’s HIP (the ROCm equivalent of the CUDA API), so the plan was to run stock CUDA PyTorch on AMD as-is.

GPU detection passes:

# run via zluda_with.exe
CUDA available: True
Device: AMD Radeon(TM) 8060S Graphics [ZLUDA]
VRAM: 53.9 GB

But it dies the moment a computation starts.

RuntimeError: CUDA error: named symbol not found

Allocating a tensor (just reserving memory) works, but even a + b won’t go through. Falling back to gfx1100 with HSA_OVERRIDE_GFX_VERSION=11.0.0 changed nothing.

The cause isn’t the GPU generation; it’s how PyTorch is distributed versus what ZLUDA needs. What matters here is the difference between PTX and SASS.

Form	What it is	Relation to ZLUDA
PTX	NVIDIA’s intermediate representation. A virtual instruction set (bytecode-like) that the driver compiles to each GPU at runtime	ZLUDA reads this PTX and converts it to HIP. With PTX present, it works
SASS / ELF	Native machine code precompiled for a specific GPU arch	ZLUDA can’t convert it

The official PyTorch CUDA wheels don’t include PTX. They ship only precompiled SASS/ELF kernels. ZLUDA can’t find the source PTX to translate, so it throws “named symbol not found” the instant a kernel is called (ZLUDA issue #626). It’s not a gfx1151-specific problem; this combination simply can’t work.

Attempt 2: the official ROCm wheel isn’t on Windows

Dropping ZLUDA, I went to install PyTorch’s official ROCm wheel directly.

pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
pip install torch --index-url https://download.pytorch.org/whl/rocm6.3
pip install torch --index-url https://download.pytorch.org/whl/rocm6.4
# → all fail: "No matching distribution found for torch"

PyTorch’s official ROCm wheels are Linux-only; there’s no Windows build. Changing the version doesn’t surface one.

Attempt 3: AMD’s TheRock gfx1151 wheel works

What worked was AMD’s TheRock — a project that builds and ships ROCm and PyTorch together, and it was publishing Windows PyTorch wheels for gfx1151. gfx1151 (the LLVM target name for the iGPU of RDNA 3.5 / Strix Halo) is the only consumer GPU marked “Release Ready” on Windows.

pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
# → torch-2.11.0+rocm7.13.0
# → rocm-sdk-core-7.13.0 / rocm-sdk-libraries-gfx1151-7.13.0
# hipcc, amdclang++ bundled too

Verify:

import torch
print(torch.__version__)               # 2.11.0+rocm7.13.0
print(torch.cuda.is_available())       # True
print(torch.cuda.get_device_name(0))   # AMD Radeon(TM) 8060S Graphics
print(torch.cuda.get_device_capability(0))  # (11, 5)

d = torch.device('cuda')
a = torch.tensor([1.0, 2.0, 3.0], device=d)
b = torch.tensor([4.0, 5.0, 6.0], device=d)
print(a + b)  # tensor([5., 7., 9.], device='cuda:0')
# fp32 / fp16 / bf16 all pass

Addition goes through, and so do fp16 and bf16. Native ROCm with no ZLUDA, and torch.cuda code runs as-is. Only now am I at the starting line.

FastVideo was unusable, so I call diffusers directly

The original target was FastWan, which normally uses the FastVideo framework. But it doesn’t run on Windows ROCm.

Blocker	Detail
Triton dependency	`fastvideo-kernel` requires `triton>=2.0.0`. Triton has no Windows support
torch.distributed missing	The TheRock PyTorch build has no `torch.distributed`; the import chain dies early on a missing `torch._C._distributed_c10d`

The second is the messy one: comment out one spot and the next distributed import dies again — whack-a-mole. The whole framework doesn’t assume Windows ROCm.

So I change approach. Drop the FastVideo “framework” and call just the “model” directly from diffusers.

from diffusers import WanPipeline  # this just works

One bit of background. FastWan’s speed comes from VSA (Video Sparse Attention, which computes attention sparsely instead of all-frames-against-all), and as I confirmed last time, it only works on H100 / A100 / 4090. VSA isn’t available on the 8060S. This time I use only the “weight-side property” of DMD distillation (Distribution Matching Distillation, which distills a many-step diffusion model down to a few steps — 3 here), not FastWan’s “speed.” So even swapping in the standard WanPipeline class, it still runs in 3 steps.

Running FastWan 1.3B (T2V)

Download the model.

from huggingface_hub import snapshot_download
snapshot_download('FastVideo/FastWan2.1-T2V-1.3B-Diffusers')
# 29 files, ~28GB (mostly the UMT5 text encoder). ~5 min download

model_index.json lists WanDMDPipeline as the original class, but diffusers 0.38.0 doesn’t have it. I substitute the standard WanPipeline. A warning shows at startup:

Some weights of the model checkpoint were not used when initializing WanTransformer3DModel:
 ['blocks.*.to_gate_compress.bias', 'blocks.*.to_gate_compress.weight']

The DMD-specific to_gate_compress layer isn’t in the standard WanTransformer3DModel, so it’s ignored. I haven’t checked the impact on quality.

Inference parameters:

Parameter	Value
Model	FastVideo/FastWan2.1-T2V-1.3B-Diffusers
Resolution	480x480
Frames	25 (≈1s @24fps)
Steps	3 (DMD distilled)
guidance_scale	1.0
Precision	fp16
Attention	SDPA (PyTorch’s Scaled Dot-Product Attention)

Prompt: a brown-haired anime girl in a red tie throwing a peace sign and winking. Here’s the result.

One second (25 frames) came out in 263 seconds. The breakdown is the problem.

Phase	Time
Model load (CPU)	25.2s
GPU transfer	19.2s
DiT, 3 steps total	35.5s (22.2 / 13.0 / 10.2s, shorter after warmup)
VAE decode + post	~227.7s
Total generation	263.2s (4m23s)

The 3 DiT steps — the diffusion core — finish in 35.5s. Of the 263s, a full 86% is spent on VAE decode. The VAE (the 1.3B AutoencoderKLWan, Conv3D-based), which only turns the 25-frame 480×480 latents back into pixels, takes 227s. That’s what set the speed.

Memory had plenty of headroom.

Metric	Value
Total VRAM	53.9 GB
VRAM used (after load)	15.7 GB
VRAM peak (generation)	20.9 GB
System RAM used	15.0 / 15.6 GB (near limit)

VRAM peaks at 20.9GB with 33GB to spare. System RAM, meanwhile, is near the ceiling at 15.0/15.6GB. The same shape I pinned down in the 4060 post — “RAM is the wall, not VRAM” — shows up here too. But because of UMA, once the model is on VRAM the RAM side is freed, so it never spilled to the page file.

A warning shows during generation:

Flash Efficient attention on Current AMD GPU is still experimental.
Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1.

AMD’s Flash Attention (the AOTriton implementation) is still experimental and falls back to SDPA by default. Setting this env var enables Flash Attention. I use it in the I2V part below.

Animating Kana-chan with Wan 2.1 I2V 14B

With T2V working, next is the real target: I2V (image-to-video, animating from a still). I use Kana-chan’s front-facing peace-sign image from the FramePack post as input.

Kana-chan's front-facing peace-sign image used as the Wan I2V input

The model is Wan-AI/Wan2.1-I2V-14B-720P-Diffusers, ~28GB in fp16. It should fit in 48GB VRAM with room to spare, but loading hit a wall three times.

Loading crashes with a Segfault

First I tried a plain CPU load. Even with low_cpu_mem_usage=True, it Segfaults at transformer shard 3/14 (around 6GB). The crash site is _local_scalar_dense_cpu. 16GB of system RAM can’t expand a 28GB model on the CPU, so it dies partway.

Next, device_map="balanced" to load straight to GPU. This skips CPU RAM and assigns shards directly to the GPU. All 14 shards land on the GPU at 46.6GB used — it fits, but only 7.3GB of VRAM is left.

Then trying to generate at 832×480×33 frames, it’s an attention OOM. Dense attention (SDPA) asks for a 29.38GB buffer, which won’t fit in 7.3GB.

The settings that got it through

Here are the settings that finally got 480×480×33 frames (about 2 seconds) through.

import os
os.environ["TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL"] = "1"  # enable Flash Attention
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

from diffusers import WanImageToVideoPipeline

pipe = WanImageToVideoPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="balanced",
    max_memory={0: "48GiB", "cpu": "12GiB"},  # put some on the CPU side
)
pipe.vae.enable_tiling()    # tile the VAE to cut memory
pipe.vae.enable_slicing()

Three things got it through.

Fix	Effect
`device_map="balanced"`	Loads straight to GPU, bypassing CPU RAM. Avoids the Segfault during CPU expansion
`max_memory` reserving 12GiB on CPU	Spreads part of the text encoder onto the CPU side, lowering the GPU-load path’s memory pressure and clearing the Segfault
Flash Attention (AOTriton) enabled	Compresses the attention buffer so 480×480×33f fits the 7.3GB free space

The second was the deciding one. With just device_map="balanced", after the transformer (28GB) is on the GPU, it always Segfaults at weight 127/242 while loading the text encoder’s (UMT5, ~10GB) safetensors. Giving the CPU side 12GiB of room with max_memory lowers the memory pressure of the GPU-load path, and it goes through. Turn Flash Attention off and it reverts to SDPA, where the 832×480×33f attention matrix at 29GB exceeds the 7.3GB free and becomes impossible.

Result

Parameter	Value
Model	Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
Input image	frame_000.png (608x640)
Output resolution	480x480
Frames	33 (≈2s @16fps)
Steps	20
guidance_scale	5.0
Precision	fp16

Phase	Time
Model load (direct to GPU)	60.8s
DiT, 20 steps	~768s (38.4s/step)
VAE decode + post	~47s
Total generation	815.1s (13.6 min)

Metric	Value
VRAM after model load	44.6 GB
VRAM peak	45.4 GB
System RAM used	6.2 / 15.6 GB

The 14B fits in 48GB VRAM. But the model eats 44.6GB, leaving only about 9GB for inference buffers. Flash Attention fit 480×480×33f into that.

Here the fixed BIOS split becomes a liability. Splitting 48GB/16GB leaves only 16GB on the CPU side, so enable_model_cpu_offload (a low-VRAM technique that keeps the model on the CPU and sends only what’s needed to the GPU) can’t be used — CPU offload Segfaults the moment it expands the model on the CPU. Re-splitting to 32GB/32GB would let CPU offload use the full GPU for compute, but that’s a separate test. Unlike T2V, this one was tight on both VRAM and RAM.

Compared with the earlier runs

Lining up the two previous machines with this run:

Setup	Model	Mode	Resolution	Frames	Steps	Time
EVO-X2 8060S (now, T2V)	FastWan 1.3B	T2V	480x480	25	3	263s (4.4 min)
EVO-X2 8060S (now, I2V)	Wan 14B	I2V	480x480	33	20	815s (13.6 min)
M1 Max	Wan 14B GGUF	I2V	832x480	33	20	4965s (82 min)
4060 Laptop	FramePack F1 13B	I2V	608x640	145	-	3363s (56 min)

For the same Wan 14B I2V at 33 frames, it’s EVO-X2’s 815s against the M1 Max’s 4965s. The resolution dropped from 832×480 to 480×480, so it’s not a clean comparison, but it’s an order of magnitude. The gap shows between the M1 Max, juggling VRAM with CPU offload, and the EVO-X2, which can full-load into 48GB.

For reference, FastWan’s official VSA-on benchmark is ~5s for 81 frames on an H200 and ~21s on a 4090. EVO-X2’s DiT-only 35.5s (3 steps) is about 1.7× the 4090’s ~21s — not bad given the GPU gap. What’s slow in FastWan is the VAE decode, not the DiT.

What decides the speed

Across all three machines, the bottleneck wasn’t the model or GPU-core speed but something upstream of it.

Machine	Bottleneck
4060	The 26GB model doesn’t fit in 32GB RAM and spills to the page file; DynamicSwap re-reads from disk every step. GPU utilization stays in single digits
M1 Max	Runs with CPU offload on 64GB shared memory, so the transfer overhead piles on
EVO-X2	Full-loads into 48GB VRAM and the GPU runs normally. In exchange, FastWan’s VAE decode (Conv3D is slow on ROCm) takes 86% of the time

EVO-X2 has a clear edge over the previous two on one point: the model fits entirely in VRAM. The 48GB of headroom matters. On the other hand, AMD’s VAE-decode and attention optimization is still immature — SDPA is experimental, and Flash Attention needs an experimental flag. “It runs now” is where things stand.

And the fixed UMA split cuts both ways. Put 48GB into VRAM and big models load, but the CPU side shrinks to 16GB and brings Segfaults and no CPU offload. In the end you just change the split in the BIOS per workload.

References

AMD TheRock — project that builds and ships ROCm + PyTorch
gfx1151 PyTorch wheel: pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch
FastVideo / FastWan — the upstream FastWan framework
ZLUDA issue #626 — the PTX incompatibility in official PyTorch wheels