Wan 14B & FastWan on Radeon 8060S: ZLUDA fails, TheRock gfx1151 wheel works
Contents
I keep redoing local video generation on different machines. On an M1 Max 64GB, Wan 2.2 took 82 minutes for 2 seconds; on an RTX 4060 Laptop 8GB, FramePack F1 took 56 minutes for 5 seconds. Both “run but slowly,” and the wall was on the memory side, not the GPU.
This time it’s the EVO-X2. VRAM 48GB is unlike the 4060’s 8GB or the M1 Max’s shared 64GB; VRAM won’t be the wall. The thing I’m unsure of is whether PyTorch runs on an AMD GPU at all. The previous two boxes were CUDA and Metal, where “PyTorch runs” was a given. With ROCm you have to question that first.
Test setup
| Item | Value |
|---|---|
| Machine | GMKtec NucBox EVO-X2 |
| CPU | AMD Ryzen AI Max+ 395 (~3000 MHz) |
| GPU | AMD Radeon 8060S (RDNA 3.5 / gfx1151) |
| Memory | 64GB UMA (BIOS split: VRAM 48GB / System RAM 16GB) |
| OS | Windows 11 Pro 26200 |
| PyTorch | 2.11.0+rocm7.13.0 |
| diffusers | 0.38.0 |
| Python | 3.11.15 (Miniconda) |
Let me cover UMA first. UMA (Unified Memory Architecture) means the CPU and GPU share the same physical memory; on Strix Halo the BIOS splits the 64GB. Here that’s 48GB to VRAM and 16GB to system RAM. The GPU side is plentiful, but the CPU side has only 16GB. This asymmetry stays a constraint all the way through the second half.
Getting PyTorch to run on an AMD GPU
This took the most time. To lay out what I did up front: every CUDA-compat layer failed, and I ended up on AMD’s official Windows ROCm wheel. Three attempts, in order.
flowchart TD
A[Get PyTorch running on ROCm] --> B[Attempt 1: ZLUDA + CUDA PyTorch]
B --> B1[GPU detected OK<br/>53.9GB VRAM visible]
B1 --> B2[All ops: named symbol not found<br/>even addition fails]
B2 --> C[Attempt 2: official ROCm wheel]
C --> C1[Not shipped for Windows<br/>No matching distribution]
C1 --> D[Attempt 3: AMD TheRock gfx1151 wheel]
D --> D1[All ops OK<br/>native ROCm, no ZLUDA]
Attempt 1: operations don’t go through
I tried ZLUDA first. ZLUDA is a compatibility layer that converts CUDA binaries to AMD’s HIP (the ROCm equivalent of the CUDA API), so the plan was to run stock CUDA PyTorch on AMD as-is.
GPU detection passes:
# run via zluda_with.exe
CUDA available: True
Device: AMD Radeon(TM) 8060S Graphics [ZLUDA]
VRAM: 53.9 GB
But it dies the moment a computation starts.
RuntimeError: CUDA error: named symbol not found
Allocating a tensor (just reserving memory) works, but even a + b won’t go through. Falling back to gfx1100 with HSA_OVERRIDE_GFX_VERSION=11.0.0 changed nothing.
The cause isn’t the GPU generation; it’s how PyTorch is distributed versus what ZLUDA needs. What matters here is the difference between PTX and SASS.
| Form | What it is | Relation to ZLUDA |
|---|---|---|
| PTX | NVIDIA’s intermediate representation. A virtual instruction set (bytecode-like) that the driver compiles to each GPU at runtime | ZLUDA reads this PTX and converts it to HIP. With PTX present, it works |
| SASS / ELF | Native machine code precompiled for a specific GPU arch | ZLUDA can’t convert it |
The official PyTorch CUDA wheels don’t include PTX. They ship only precompiled SASS/ELF kernels. ZLUDA can’t find the source PTX to translate, so it throws “named symbol not found” the instant a kernel is called (ZLUDA issue #626). It’s not a gfx1151-specific problem; this combination simply can’t work.
Attempt 2: the official ROCm wheel isn’t on Windows
Dropping ZLUDA, I went to install PyTorch’s official ROCm wheel directly.
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
pip install torch --index-url https://download.pytorch.org/whl/rocm6.3
pip install torch --index-url https://download.pytorch.org/whl/rocm6.4
# → all fail: "No matching distribution found for torch"
PyTorch’s official ROCm wheels are Linux-only; there’s no Windows build. Changing the version doesn’t surface one.
Attempt 3: AMD’s TheRock gfx1151 wheel works
What worked was AMD’s TheRock — a project that builds and ships ROCm and PyTorch together, and it was publishing Windows PyTorch wheels for gfx1151. gfx1151 (the LLVM target name for the iGPU of RDNA 3.5 / Strix Halo) is the only consumer GPU marked “Release Ready” on Windows.
pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch torchvision torchaudio
# → torch-2.11.0+rocm7.13.0
# → rocm-sdk-core-7.13.0 / rocm-sdk-libraries-gfx1151-7.13.0
# hipcc, amdclang++ bundled too
Verify:
import torch
print(torch.__version__) # 2.11.0+rocm7.13.0
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # AMD Radeon(TM) 8060S Graphics
print(torch.cuda.get_device_capability(0)) # (11, 5)
d = torch.device('cuda')
a = torch.tensor([1.0, 2.0, 3.0], device=d)
b = torch.tensor([4.0, 5.0, 6.0], device=d)
print(a + b) # tensor([5., 7., 9.], device='cuda:0')
# fp32 / fp16 / bf16 all pass
Addition goes through, and so do fp16 and bf16. Native ROCm with no ZLUDA, and torch.cuda code runs as-is. Only now am I at the starting line.
FastVideo was unusable, so I call diffusers directly
The original target was FastWan, which normally uses the FastVideo framework. But it doesn’t run on Windows ROCm.
| Blocker | Detail |
|---|---|
| Triton dependency | fastvideo-kernel requires triton>=2.0.0. Triton has no Windows support |
| torch.distributed missing | The TheRock PyTorch build has no torch.distributed; the import chain dies early on a missing torch._C._distributed_c10d |
The second is the messy one: comment out one spot and the next distributed import dies again — whack-a-mole. The whole framework doesn’t assume Windows ROCm.
So I change approach. Drop the FastVideo “framework” and call just the “model” directly from diffusers.
from diffusers import WanPipeline # this just works
One bit of background. FastWan’s speed comes from VSA (Video Sparse Attention, which computes attention sparsely instead of all-frames-against-all), and as I confirmed last time, it only works on H100 / A100 / 4090. VSA isn’t available on the 8060S. This time I use only the “weight-side property” of DMD distillation (Distribution Matching Distillation, which distills a many-step diffusion model down to a few steps — 3 here), not FastWan’s “speed.” So even swapping in the standard WanPipeline class, it still runs in 3 steps.
Running FastWan 1.3B (T2V)
Download the model.
from huggingface_hub import snapshot_download
snapshot_download('FastVideo/FastWan2.1-T2V-1.3B-Diffusers')
# 29 files, ~28GB (mostly the UMT5 text encoder). ~5 min download
model_index.json lists WanDMDPipeline as the original class, but diffusers 0.38.0 doesn’t have it. I substitute the standard WanPipeline. A warning shows at startup:
Some weights of the model checkpoint were not used when initializing WanTransformer3DModel:
['blocks.*.to_gate_compress.bias', 'blocks.*.to_gate_compress.weight']
The DMD-specific to_gate_compress layer isn’t in the standard WanTransformer3DModel, so it’s ignored. I haven’t checked the impact on quality.
Inference parameters:
| Parameter | Value |
|---|---|
| Model | FastVideo/FastWan2.1-T2V-1.3B-Diffusers |
| Resolution | 480x480 |
| Frames | 25 (≈1s @24fps) |
| Steps | 3 (DMD distilled) |
| guidance_scale | 1.0 |
| Precision | fp16 |
| Attention | SDPA (PyTorch’s Scaled Dot-Product Attention) |
Prompt: a brown-haired anime girl in a red tie throwing a peace sign and winking. Here’s the result.
One second (25 frames) came out in 263 seconds. The breakdown is the problem.
| Phase | Time |
|---|---|
| Model load (CPU) | 25.2s |
| GPU transfer | 19.2s |
| DiT, 3 steps total | 35.5s (22.2 / 13.0 / 10.2s, shorter after warmup) |
| VAE decode + post | ~227.7s |
| Total generation | 263.2s (4m23s) |
The 3 DiT steps — the diffusion core — finish in 35.5s. Of the 263s, a full 86% is spent on VAE decode. The VAE (the 1.3B AutoencoderKLWan, Conv3D-based), which only turns the 25-frame 480×480 latents back into pixels, takes 227s. That’s what set the speed.
Memory had plenty of headroom.
| Metric | Value |
|---|---|
| Total VRAM | 53.9 GB |
| VRAM used (after load) | 15.7 GB |
| VRAM peak (generation) | 20.9 GB |
| System RAM used | 15.0 / 15.6 GB (near limit) |
VRAM peaks at 20.9GB with 33GB to spare. System RAM, meanwhile, is near the ceiling at 15.0/15.6GB. The same shape I pinned down in the 4060 post — “RAM is the wall, not VRAM” — shows up here too. But because of UMA, once the model is on VRAM the RAM side is freed, so it never spilled to the page file.
A warning shows during generation:
Flash Efficient attention on Current AMD GPU is still experimental.
Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1.
AMD’s Flash Attention (the AOTriton implementation) is still experimental and falls back to SDPA by default. Setting this env var enables Flash Attention. I use it in the I2V part below.
Animating Kana-chan with Wan 2.1 I2V 14B
With T2V working, next is the real target: I2V (image-to-video, animating from a still). I use Kana-chan’s front-facing peace-sign image from the FramePack post as input.

The model is Wan-AI/Wan2.1-I2V-14B-720P-Diffusers, ~28GB in fp16. It should fit in 48GB VRAM with room to spare, but loading hit a wall three times.
Loading crashes with a Segfault
First I tried a plain CPU load. Even with low_cpu_mem_usage=True, it Segfaults at transformer shard 3/14 (around 6GB). The crash site is _local_scalar_dense_cpu. 16GB of system RAM can’t expand a 28GB model on the CPU, so it dies partway.
Next, device_map="balanced" to load straight to GPU. This skips CPU RAM and assigns shards directly to the GPU. All 14 shards land on the GPU at 46.6GB used — it fits, but only 7.3GB of VRAM is left.
Then trying to generate at 832×480×33 frames, it’s an attention OOM. Dense attention (SDPA) asks for a 29.38GB buffer, which won’t fit in 7.3GB.
The settings that got it through
Here are the settings that finally got 480×480×33 frames (about 2 seconds) through.
import os
os.environ["TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL"] = "1" # enable Flash Attention
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
from diffusers import WanImageToVideoPipeline
pipe = WanImageToVideoPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="balanced",
max_memory={0: "48GiB", "cpu": "12GiB"}, # put some on the CPU side
)
pipe.vae.enable_tiling() # tile the VAE to cut memory
pipe.vae.enable_slicing()
Three things got it through.
| Fix | Effect |
|---|---|
device_map="balanced" | Loads straight to GPU, bypassing CPU RAM. Avoids the Segfault during CPU expansion |
max_memory reserving 12GiB on CPU | Spreads part of the text encoder onto the CPU side, lowering the GPU-load path’s memory pressure and clearing the Segfault |
| Flash Attention (AOTriton) enabled | Compresses the attention buffer so 480×480×33f fits the 7.3GB free space |
The second was the deciding one. With just device_map="balanced", after the transformer (28GB) is on the GPU, it always Segfaults at weight 127/242 while loading the text encoder’s (UMT5, ~10GB) safetensors. Giving the CPU side 12GiB of room with max_memory lowers the memory pressure of the GPU-load path, and it goes through. Turn Flash Attention off and it reverts to SDPA, where the 832×480×33f attention matrix at 29GB exceeds the 7.3GB free and becomes impossible.
Result
| Parameter | Value |
|---|---|
| Model | Wan-AI/Wan2.1-I2V-14B-720P-Diffusers |
| Input image | frame_000.png (608x640) |
| Output resolution | 480x480 |
| Frames | 33 (≈2s @16fps) |
| Steps | 20 |
| guidance_scale | 5.0 |
| Precision | fp16 |
| Phase | Time |
|---|---|
| Model load (direct to GPU) | 60.8s |
| DiT, 20 steps | ~768s (38.4s/step) |
| VAE decode + post | ~47s |
| Total generation | 815.1s (13.6 min) |
| Metric | Value |
|---|---|
| VRAM after model load | 44.6 GB |
| VRAM peak | 45.4 GB |
| System RAM used | 6.2 / 15.6 GB |
The 14B fits in 48GB VRAM. But the model eats 44.6GB, leaving only about 9GB for inference buffers. Flash Attention fit 480×480×33f into that.
Here the fixed BIOS split becomes a liability. Splitting 48GB/16GB leaves only 16GB on the CPU side, so enable_model_cpu_offload (a low-VRAM technique that keeps the model on the CPU and sends only what’s needed to the GPU) can’t be used — CPU offload Segfaults the moment it expands the model on the CPU. Re-splitting to 32GB/32GB would let CPU offload use the full GPU for compute, but that’s a separate test. Unlike T2V, this one was tight on both VRAM and RAM.
Compared with the earlier runs
Lining up the two previous machines with this run:
| Setup | Model | Mode | Resolution | Frames | Steps | Time |
|---|---|---|---|---|---|---|
| EVO-X2 8060S (now, T2V) | FastWan 1.3B | T2V | 480x480 | 25 | 3 | 263s (4.4 min) |
| EVO-X2 8060S (now, I2V) | Wan 14B | I2V | 480x480 | 33 | 20 | 815s (13.6 min) |
| M1 Max | Wan 14B GGUF | I2V | 832x480 | 33 | 20 | 4965s (82 min) |
| 4060 Laptop | FramePack F1 13B | I2V | 608x640 | 145 | - | 3363s (56 min) |
For the same Wan 14B I2V at 33 frames, it’s EVO-X2’s 815s against the M1 Max’s 4965s. The resolution dropped from 832×480 to 480×480, so it’s not a clean comparison, but it’s an order of magnitude. The gap shows between the M1 Max, juggling VRAM with CPU offload, and the EVO-X2, which can full-load into 48GB.
For reference, FastWan’s official VSA-on benchmark is ~5s for 81 frames on an H200 and ~21s on a 4090. EVO-X2’s DiT-only 35.5s (3 steps) is about 1.7× the 4090’s ~21s — not bad given the GPU gap. What’s slow in FastWan is the VAE decode, not the DiT.
What decides the speed
Across all three machines, the bottleneck wasn’t the model or GPU-core speed but something upstream of it.
| Machine | Bottleneck |
|---|---|
| 4060 | The 26GB model doesn’t fit in 32GB RAM and spills to the page file; DynamicSwap re-reads from disk every step. GPU utilization stays in single digits |
| M1 Max | Runs with CPU offload on 64GB shared memory, so the transfer overhead piles on |
| EVO-X2 | Full-loads into 48GB VRAM and the GPU runs normally. In exchange, FastWan’s VAE decode (Conv3D is slow on ROCm) takes 86% of the time |
EVO-X2 has a clear edge over the previous two on one point: the model fits entirely in VRAM. The 48GB of headroom matters. On the other hand, AMD’s VAE-decode and attention optimization is still immature — SDPA is experimental, and Flash Attention needs an experimental flag. “It runs now” is where things stand.
And the fixed UMA split cuts both ways. Put 48GB into VRAM and big models load, but the CPU side shrinks to 16GB and brings Segfaults and no CPU offload. In the end you just change the split in the BIOS per workload.
References
- AMD TheRock — project that builds and ships ROCm + PyTorch
- gfx1151 PyTorch wheel:
pip install --index-url https://repo.amd.com/rocm/whl/gfx1151/ torch - FastVideo / FastWan — the upstream FastWan framework
- ZLUDA issue #626 — the PTX incompatibility in official PyTorch wheels