The Reason Qwen 3.5 Failed on Radeon 8060S Was an Outdated AMD Driver

If Qwen 3.5 isn’t working on EVO-X2 (Strix Halo / Radeon 8060S), update your AMD driver. The pre-installed AMD Software has no auto-update. Download the latest from AMD’s official site and get to Adrenalin 26.2.2 or later — Vulkan GPU inference will work correctly. What follows is the investigation that led to this conclusion.

In the previous article, Qwen 3.5 completely failed on Radeon 8060S (ROCm / gfx1151). ROCm produced garbage tokens, Vulkan crashed, LM Studio crashed too. Mac (Metal) worked fine, so the conclusion was a backend problem, not the model itself.

This time, isolating via CPU inference and llama-server (native llama.cpp) eventually led to an AMD driver update resolving everything.

Environment

Item	Spec
PC	GMKtec EVO-X2
CPU	AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T)
GPU	Radeon 8060S
Memory	64GB (unified memory)
OS	Windows 11
llama.cpp	b8183 (2026-03-01)
LM Studio	0.4.6

Memory allocation configured in BIOS. Physically the same DRAM in unified memory, but OS treats it as separate system memory and VRAM. Tested both 48GB/16GB and 32GB/32GB splits during this investigation.

Setting Up llama-server

Ollama and LM Studio both run llama.cpp internally, but with their own wrappers and forks. Using llama-server, llama.cpp’s direct HTTP server, lets you isolate backend behavior.

Downloaded pre-built binaries from llama.cpp GitHub Releases. Used b8183 (released 2026-03-01).

Binary	Purpose
`llama-b8183-bin-win-cpu-x64.zip`	CPU inference
`llama-b8183-bin-win-hip-radeon-x64.zip`	ROCm (HIP) GPU inference
`llama-b8183-bin-win-vulkan-x64.zip`	Vulkan GPU inference

Model: Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf (9.76GB / 2.25 BPW) from unsloth/Qwen3.5-35B-A3B-GGUF. Chose the lightest quantization because it needed to fit in 16GB system memory.

CPU Inference: Works Fine

C:\llama\llama-server.exe -m "Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf" --port 8080 --ctx-size 512 --reasoning-budget 0

Confirmed backend is ggml-cpu-zen4.dll in startup log. --ctx-size 512 minimizes KV Cache memory. --reasoning-budget 0 disables thinking.

Sent “こんちわ” from the WebUI at http://localhost:8080:

こんにちは！何かお手伝できることはありますか？

Worked normally. 11 tokens / 0.8 seconds / 14.38 t/s. Zero garbage tokens.

Metric	Value
Eval	14.38 t/s
CPU usage	~50% (16 threads across 16 cores)
GPU usage	0%
Memory	14GB system memory

MoE with only active 3B means actual computation is equivalent to a 3B model. Zen 5’s AVX-512 (VBMI / VNNI / BF16) fires on all cylinders, and MoE plays well with CPU inference.

Qwen 3.5’s Architecture Turned Out to Be Surprisingly Unusual

Startup log metadata reveals Qwen 3.5’s internal structure. The previous article guessed “GQA and RoPE can’t be processed by ROCm,” but the reality is more complex.

Parameter	Value
Architecture	`qwen35moe` (Attention + SSM hybrid)
Expert count	256 (8 active)
`full_attention_interval`	4 (full Attention only every 4 layers)
SSM	d_conv=4, d_state=128, d_inner=4096
RoPE	type=40 (mrope), sections=[11, 11, 10, 0]
KV buffer	10MB (for Attention, 10 layers)
RS buffer	251MB (SSM recurrent state, 40 layers)

Only 10 of 40 layers are Attention; the remaining 30 are SSM (Mamba-type). Recurrent state is 25x larger than KV Cache. RoPE is mrope (multi-dimensional RoPE) rather than standard, combined with 256-expert MoE. Qwen 2.5 runs fine on GPU in the same environment, so this unusual architecture was causing GPU backend incompatibility.

HIP Build: Doesn’t Recognize APU’s GPU

Tried GPU inference with the HIP build (ROCm) — to determine whether the garbage tokens in Ollama were Ollama fork-specific or a llama.cpp ROCm kernel issue.

load_backend: loaded CPU backend from C:\llama-hip\ggml-cpu-zen4.dll
llama_params_fit_impl: no devices with dedicated memory found

HIP/ROCm backend doesn’t load. no devices with dedicated memory found — Ryzen AI Max+ 395 is an APU with unified memory, so it’s not recognized as “a device with dedicated memory.” Native llama.cpp’s HIP build doesn’t account for APU unified memory.

Result: CPU fallback. 13.73 t/s, essentially the same as the CPU version. No GPU inference testing was possible.

Ollama can recognize gfx1151 and load it onto GPU via ROCm (startup log shows library=ROCm compute=gfx1151), so APU support is an Ollama-specific implementation.

Vulkan Build: GPU Recognized, But Crashes at Inference

Found a report of successful Vulkan GPU inference with llama.cpp on the same EVO-X2, so I tried the Vulkan build too.

ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1
load_tensors:      Vulkan0 model buffer size =  8959.58 MiB

Vulkan correctly recognized Radeon 8060S as uma: 1 (unified memory architecture). Most of the model loaded onto Vulkan0 (GPU).

Sending “こんちわ” produced prompt processing succeeds but token generation crashes immediately:

slot update_slots: prompt processing done, n_tokens = 14
(process exits, no timing log)

Prompt eval (encoding the input) works, but the process dies the moment eval (autoregressive token generation) begins. No error log.

LM Studio 0.4.6: Same Wall

Updated LM Studio to 0.4.6, also tried with Q4_K_M (19.71GB).

Offloading all 41 layers to GPU caused exit code overflow (18446744072635810000) on load. The default 20-layer setting gets through loading and prompt processing, but token generation still dies. Exact same pattern as llama-server Vulkan.

Changed BIOS Memory Split (32GB/32GB)

In a previous article, I confirmed that Strix Halo’s Vulkan driver preferentially uses shared memory (system memory side) rather than VRAM. With VRAM 48GB/system 16GB, Vulkan puts things on system memory which then overflows.

Changed BIOS to VRAM 32GB/system memory 32GB and retested.

load_tensors: offloaded 41/41 layers to GPU
load_tensors:      Vulkan0 model buffer size = 19905.15 MiB

All 41 layers loaded onto GPU. Memory issue resolved. But token generation still crashes. The problem is definitively Vulkan compute kernels, not memory.

The AMD Driver Was Outdated

Found an X post showing LM Studio + Qwen 3.5 with Vulkan at 47 t/s on the same Radeon 8060S (128GB model). Same settings, still wouldn’t work.

Checking the driver: the AMD Software pre-installed on EVO-X2 had no auto-update feature. Downloaded the latest AMD Software (with auto-update) from AMD’s official site and updated to Adrenalin 26.2.2 (released 2026-02-17).

こんにちは！元気ですか？今日はどんな一日でしたか？

It worked. Vulkan GPU inference functioning normally from just a driver update.

Metric	Value
Eval	34.99 t/s
VRAM usage	22.1GB
System memory	9.2GB (OS only)
Shared memory	0.6GB

From CPU inference’s 14 t/s to 35 t/s — 2.5x improvement.

Even more important: memory management was also fixed. With the old driver, Vulkan was preferentially using shared memory (system memory side), filling system memory while VRAM sat mostly empty. With the new driver, the model correctly sits in VRAM, with only 9.2GB of system memory for OS. Shared memory is down to 0.6GB.

The “previous article found reducing VRAM worked better” was probably just working around the old driver’s shared memory priority bug.

Trying Q6_K

Reset BIOS to VRAM 48GB/system 16GB, retested with Q6_K (lmstudio-community/Qwen3.5-35B-A3B-GGUF):

Metric	Q4_K_M	Q6_K
Eval	34.99 t/s	41.22 t/s
VRAM usage	22.1GB	22.2GB

Q6_K is faster than Q4_K_M. Q6_K’s simpler dequantization may have better compatibility with Vulkan compute shaders. 48GB VRAM with only ~22GB used leaves plenty of room.

Q8_0 was also tried, but it overflowed on the shared memory path for VRAM transfer and couldn’t load. Q6_K is the practical upper limit for this environment (64GB / VRAM 48GB).

For llama-server with 48GB/16GB config, loading Q6_K overflows system memory, so retested with BIOS at 32GB/32GB. 35.15 t/s with this setup. After load completes, system memory has 20GB+ free, so KV Cache can be placed in system memory for large context lengths.

Abliterated Version Also Works

Tested the abliterated version (mradermacher/Huihui-Qwen3.5-35B-A3B-abliterated-GGUF, Q4_K_M) that completely failed in the previous article, via llama-server Vulkan:

C:\llama-vulkan\llama-server.exe -m "Huihui-Qwen3.5-35B-A3B-abliterated.Q4_K_M.gguf" --port 8080 --ctx-size 4096 --reasoning-budget 0 --n-gpu-layers 99

こんにちは！元気にお過ごしですか？
何かお手伝いできることがありましたら、いつでもお気軽にお声がけください。

47.57 t/s. The abliterated + Radeon 8060S combination that completely failed before is now fully working with just a driver update. All 41 layers loaded onto Vulkan0, clean graph splits = 2 configuration.

Also tested Q6_K (mradermacher/Huihui-Qwen3.5-35B-A3B-abliterated-GGUF) with --no-mmap. VRAM 27.9GB, system memory 9GB, shared 0.5GB — fully in VRAM, 53.83 t/s. Well above the official Q6_K’s 41 t/s (LM Studio).

—no-mmap Is Essential for Unified Memory APUs

The driver update fixed Vulkan compute kernels, but with mmap enabled (default), system memory gets pushed with a problem remaining. In VRAM 48GB/system 16GB config, even with the model (Q4_K_M / 20GB) in VRAM, system memory runs nearly 100%.

The cause is mmap double-mapping. The GGUF file is memory-mapped into system memory (~20GB), then Vulkan copies it to VRAM. Physically the same DRAM in unified memory, but it’s double-allocated in OS address space.

With --no-mmap, the system memory mapping is removed and the model loads directly into the VRAM buffer.

C:\llama-vulkan\llama-server.exe -m "model.gguf" --port 8080 --ctx-size 4096 --reasoning-budget 0 --n-gpu-layers 99 --no-mmap

Metric	mmap enabled (default)	—no-mmap
System memory	28GB (pressured)	8.8GB
VRAM	21GB	21GB
Shared memory	5.6GB	0.4GB
Eval speed	47.57 t/s	49.18 t/s

--no-mmap drops system memory to 8.8GB (OS only). Speed also nudges up from 47 to 49 t/s. Shared memory down to 0.4GB.

Even with VRAM 48GB/system 16GB BIOS settings, --no-mmap leaves 7GB+ of the 16GB free. Always use --no-mmap with llama-server on unified memory APUs.

Load-Time Transfer Buffer Remains with —no-mmap

--no-mmap prevents double-mapping during inference, but loading a model still routes through system memory as a transfer buffer. With VRAM 48GB/system 16GB config, loading Q8_0 (35GB) or Q6_K (28GB) in llama-server causes ErrorOutOfDeviceMemory because the transfer buffer exceeds the 16GB system memory limit.

Switching to VRAM 32GB/system 32GB gives enough system memory margin for Q6_K to load, but Q8_0 won’t fit in 32GB VRAM so it can’t load in any config with llama-server.

LM Studio can load Q6_K with 48GB/16GB config using the same Vulkan backend, suggesting its buffer management is more efficient than llama-server. For large models, LM Studio has the advantage.

Correcting Previous Article’s Conclusions

The previous article concluded:

Qwen 3.5 doesn’t work on Radeon 8060S (ROCm / gfx1151)
Abliteration is innocent (verified on Mac)
llama.cpp’s ROCm backend can’t handle Qwen 3.5’s architecture on gfx1151

The third point was inaccurate. More precisely:

Radeon 8060S had an outdated AMD driver; Vulkan compute shaders couldn’t handle qwen35moe token generation
Fixed in driver 26.2.2 (released 2026-02-17)
Old driver also had broken memory management, placing data in shared memory instead of VRAM

The garbage tokens from Ollama’s ROCm may also be driver-related, but ROCm wasn’t re-tested after the driver update, so it’s unconfirmed.

For Anyone Who Bought a GMKtec EVO-X2

The AMD Software pre-installed on EVO-X2 has no auto-update feature. Without manually updating the driver, you’re stuck with the outdated version from the factory.

Download the latest AMD Software from AMD’s official site and install it — you’ll switch to a version with auto-update. RDNA 4 (gfx1151) is a new architecture with driver maturity still improving, so staying on the latest driver is strongly recommended.

Local LLM with a specific model not working, crashes on Vulkan, memory management behaving oddly — first check the driver.

Test Summary

Backend	Old driver	Driver 26.2.2
Vulkan (LM Studio / Q4_K_M)	Eval crash	Normal (35 t/s)
Vulkan (LM Studio / Q6_K)	Eval crash	Normal (41 t/s)
Vulkan (llama-server / Q6_K)	Eval crash	Normal (35 t/s)
Vulkan (llama-server / abliterated Q4_K_M)	Eval crash	Normal (49 t/s)
Vulkan (llama-server / abliterated Q6_K)	Eval crash	Normal (54 t/s)
ROCm (Ollama)	Garbage tokens	Untested
CPU (llama-server)	Normal (14 t/s)	Normal (14 t/s)

To belabor the point: update your driver.