The Reason Qwen 3.5 Failed on Radeon 8060S Was an Outdated AMD Driver
If Qwen 3.5 isn’t working on EVO-X2 (Strix Halo / Radeon 8060S), update your AMD driver. The pre-installed AMD Software has no auto-update. Download the latest from AMD’s official site and get to Adrenalin 26.2.2 or later — Vulkan GPU inference will work correctly. What follows is the investigation that led to this conclusion.
In the previous article, Qwen 3.5 completely failed on Radeon 8060S (ROCm / gfx1151). ROCm produced garbage tokens, Vulkan crashed, LM Studio crashed too. Mac (Metal) worked fine, so the conclusion was a backend problem, not the model itself.
This time, isolating via CPU inference and llama-server (native llama.cpp) eventually led to an AMD driver update resolving everything.
Environment
| Item | Spec |
|---|---|
| PC | GMKtec EVO-X2 |
| CPU | AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T) |
| GPU | Radeon 8060S |
| Memory | 64GB (unified memory) |
| OS | Windows 11 |
| llama.cpp | b8183 (2026-03-01) |
| LM Studio | 0.4.6 |
Memory allocation configured in BIOS. Physically the same DRAM in unified memory, but OS treats it as separate system memory and VRAM. Tested both 48GB/16GB and 32GB/32GB splits during this investigation.
Setting Up llama-server
Ollama and LM Studio both run llama.cpp internally, but with their own wrappers and forks. Using llama-server, llama.cpp’s direct HTTP server, lets you isolate backend behavior.
Downloaded pre-built binaries from llama.cpp GitHub Releases. Used b8183 (released 2026-03-01).
| Binary | Purpose |
|---|---|
llama-b8183-bin-win-cpu-x64.zip | CPU inference |
llama-b8183-bin-win-hip-radeon-x64.zip | ROCm (HIP) GPU inference |
llama-b8183-bin-win-vulkan-x64.zip | Vulkan GPU inference |
Model: Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf (9.76GB / 2.25 BPW) from unsloth/Qwen3.5-35B-A3B-GGUF. Chose the lightest quantization because it needed to fit in 16GB system memory.
CPU Inference: Works Fine
C:\llama\llama-server.exe -m "Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf" --port 8080 --ctx-size 512 --reasoning-budget 0
Confirmed backend is ggml-cpu-zen4.dll in startup log. --ctx-size 512 minimizes KV Cache memory. --reasoning-budget 0 disables thinking.
Sent “こんちわ” from the WebUI at http://localhost:8080:
こんにちは!何かお手伝できることはありますか?
Worked normally. 11 tokens / 0.8 seconds / 14.38 t/s. Zero garbage tokens.
| Metric | Value |
|---|---|
| Eval | 14.38 t/s |
| CPU usage | ~50% (16 threads across 16 cores) |
| GPU usage | 0% |
| Memory | 14GB system memory |
MoE with only active 3B means actual computation is equivalent to a 3B model. Zen 5’s AVX-512 (VBMI / VNNI / BF16) fires on all cylinders, and MoE plays well with CPU inference.
Qwen 3.5’s Architecture Turned Out to Be Surprisingly Unusual
Startup log metadata reveals Qwen 3.5’s internal structure. The previous article guessed “GQA and RoPE can’t be processed by ROCm,” but the reality is more complex.
| Parameter | Value |
|---|---|
| Architecture | qwen35moe (Attention + SSM hybrid) |
| Expert count | 256 (8 active) |
full_attention_interval | 4 (full Attention only every 4 layers) |
| SSM | d_conv=4, d_state=128, d_inner=4096 |
| RoPE | type=40 (mrope), sections=[11, 11, 10, 0] |
| KV buffer | 10MB (for Attention, 10 layers) |
| RS buffer | 251MB (SSM recurrent state, 40 layers) |
Only 10 of 40 layers are Attention; the remaining 30 are SSM (Mamba-type). Recurrent state is 25x larger than KV Cache. RoPE is mrope (multi-dimensional RoPE) rather than standard, combined with 256-expert MoE. Qwen 2.5 runs fine on GPU in the same environment, so this unusual architecture was causing GPU backend incompatibility.
HIP Build: Doesn’t Recognize APU’s GPU
Tried GPU inference with the HIP build (ROCm) — to determine whether the garbage tokens in Ollama were Ollama fork-specific or a llama.cpp ROCm kernel issue.
load_backend: loaded CPU backend from C:\llama-hip\ggml-cpu-zen4.dll
llama_params_fit_impl: no devices with dedicated memory found
HIP/ROCm backend doesn’t load. no devices with dedicated memory found — Ryzen AI Max+ 395 is an APU with unified memory, so it’s not recognized as “a device with dedicated memory.” Native llama.cpp’s HIP build doesn’t account for APU unified memory.
Result: CPU fallback. 13.73 t/s, essentially the same as the CPU version. No GPU inference testing was possible.
Ollama can recognize gfx1151 and load it onto GPU via ROCm (startup log shows library=ROCm compute=gfx1151), so APU support is an Ollama-specific implementation.
Vulkan Build: GPU Recognized, But Crashes at Inference
Found a report of successful Vulkan GPU inference with llama.cpp on the same EVO-X2, so I tried the Vulkan build too.
ggml_vulkan: 0 = AMD Radeon(TM) 8060S Graphics (AMD proprietary driver) | uma: 1
load_tensors: Vulkan0 model buffer size = 8959.58 MiB
Vulkan correctly recognized Radeon 8060S as uma: 1 (unified memory architecture). Most of the model loaded onto Vulkan0 (GPU).
Sending “こんちわ” produced prompt processing succeeds but token generation crashes immediately:
slot update_slots: prompt processing done, n_tokens = 14
(process exits, no timing log)
Prompt eval (encoding the input) works, but the process dies the moment eval (autoregressive token generation) begins. No error log.
LM Studio 0.4.6: Same Wall
Updated LM Studio to 0.4.6, also tried with Q4_K_M (19.71GB).
Offloading all 41 layers to GPU caused exit code overflow (18446744072635810000) on load. The default 20-layer setting gets through loading and prompt processing, but token generation still dies. Exact same pattern as llama-server Vulkan.
Changed BIOS Memory Split (32GB/32GB)
In a previous article, I confirmed that Strix Halo’s Vulkan driver preferentially uses shared memory (system memory side) rather than VRAM. With VRAM 48GB/system 16GB, Vulkan puts things on system memory which then overflows.
Changed BIOS to VRAM 32GB/system memory 32GB and retested.
load_tensors: offloaded 41/41 layers to GPU
load_tensors: Vulkan0 model buffer size = 19905.15 MiB
All 41 layers loaded onto GPU. Memory issue resolved. But token generation still crashes. The problem is definitively Vulkan compute kernels, not memory.
The AMD Driver Was Outdated
Found an X post showing LM Studio + Qwen 3.5 with Vulkan at 47 t/s on the same Radeon 8060S (128GB model). Same settings, still wouldn’t work.
Checking the driver: the AMD Software pre-installed on EVO-X2 had no auto-update feature. Downloaded the latest AMD Software (with auto-update) from AMD’s official site and updated to Adrenalin 26.2.2 (released 2026-02-17).
こんにちは!元気ですか?今日はどんな一日でしたか?
It worked. Vulkan GPU inference functioning normally from just a driver update.
| Metric | Value |
|---|---|
| Eval | 34.99 t/s |
| VRAM usage | 22.1GB |
| System memory | 9.2GB (OS only) |
| Shared memory | 0.6GB |
From CPU inference’s 14 t/s to 35 t/s — 2.5x improvement.
Even more important: memory management was also fixed. With the old driver, Vulkan was preferentially using shared memory (system memory side), filling system memory while VRAM sat mostly empty. With the new driver, the model correctly sits in VRAM, with only 9.2GB of system memory for OS. Shared memory is down to 0.6GB.
The “previous article found reducing VRAM worked better” was probably just working around the old driver’s shared memory priority bug.
Trying Q6_K
Reset BIOS to VRAM 48GB/system 16GB, retested with Q6_K (lmstudio-community/Qwen3.5-35B-A3B-GGUF):
| Metric | Q4_K_M | Q6_K |
|---|---|---|
| Eval | 34.99 t/s | 41.22 t/s |
| VRAM usage | 22.1GB | 22.2GB |
Q6_K is faster than Q4_K_M. Q6_K’s simpler dequantization may have better compatibility with Vulkan compute shaders. 48GB VRAM with only ~22GB used leaves plenty of room.
Q8_0 was also tried, but it overflowed on the shared memory path for VRAM transfer and couldn’t load. Q6_K is the practical upper limit for this environment (64GB / VRAM 48GB).
For llama-server with 48GB/16GB config, loading Q6_K overflows system memory, so retested with BIOS at 32GB/32GB. 35.15 t/s with this setup. After load completes, system memory has 20GB+ free, so KV Cache can be placed in system memory for large context lengths.
Abliterated Version Also Works
Tested the abliterated version (mradermacher/Huihui-Qwen3.5-35B-A3B-abliterated-GGUF, Q4_K_M) that completely failed in the previous article, via llama-server Vulkan:
C:\llama-vulkan\llama-server.exe -m "Huihui-Qwen3.5-35B-A3B-abliterated.Q4_K_M.gguf" --port 8080 --ctx-size 4096 --reasoning-budget 0 --n-gpu-layers 99
こんにちは!元気にお過ごしですか?
何かお手伝いできることがありましたら、いつでもお気軽にお声がけください。
47.57 t/s. The abliterated + Radeon 8060S combination that completely failed before is now fully working with just a driver update. All 41 layers loaded onto Vulkan0, clean graph splits = 2 configuration.
Also tested Q6_K (mradermacher/Huihui-Qwen3.5-35B-A3B-abliterated-GGUF) with --no-mmap. VRAM 27.9GB, system memory 9GB, shared 0.5GB — fully in VRAM, 53.83 t/s. Well above the official Q6_K’s 41 t/s (LM Studio).
—no-mmap Is Essential for Unified Memory APUs
The driver update fixed Vulkan compute kernels, but with mmap enabled (default), system memory gets pushed with a problem remaining. In VRAM 48GB/system 16GB config, even with the model (Q4_K_M / 20GB) in VRAM, system memory runs nearly 100%.
The cause is mmap double-mapping. The GGUF file is memory-mapped into system memory (~20GB), then Vulkan copies it to VRAM. Physically the same DRAM in unified memory, but it’s double-allocated in OS address space.
With --no-mmap, the system memory mapping is removed and the model loads directly into the VRAM buffer.
C:\llama-vulkan\llama-server.exe -m "model.gguf" --port 8080 --ctx-size 4096 --reasoning-budget 0 --n-gpu-layers 99 --no-mmap
| Metric | mmap enabled (default) | —no-mmap |
|---|---|---|
| System memory | 28GB (pressured) | 8.8GB |
| VRAM | 21GB | 21GB |
| Shared memory | 5.6GB | 0.4GB |
| Eval speed | 47.57 t/s | 49.18 t/s |
--no-mmap drops system memory to 8.8GB (OS only). Speed also nudges up from 47 to 49 t/s. Shared memory down to 0.4GB.
Even with VRAM 48GB/system 16GB BIOS settings, --no-mmap leaves 7GB+ of the 16GB free. Always use --no-mmap with llama-server on unified memory APUs.
Load-Time Transfer Buffer Remains with —no-mmap
--no-mmap prevents double-mapping during inference, but loading a model still routes through system memory as a transfer buffer. With VRAM 48GB/system 16GB config, loading Q8_0 (35GB) or Q6_K (28GB) in llama-server causes ErrorOutOfDeviceMemory because the transfer buffer exceeds the 16GB system memory limit.
Switching to VRAM 32GB/system 32GB gives enough system memory margin for Q6_K to load, but Q8_0 won’t fit in 32GB VRAM so it can’t load in any config with llama-server.
LM Studio can load Q6_K with 48GB/16GB config using the same Vulkan backend, suggesting its buffer management is more efficient than llama-server. For large models, LM Studio has the advantage.
Correcting Previous Article’s Conclusions
The previous article concluded:
- Qwen 3.5 doesn’t work on Radeon 8060S (ROCm / gfx1151)
- Abliteration is innocent (verified on Mac)
- llama.cpp’s ROCm backend can’t handle Qwen 3.5’s architecture on gfx1151
The third point was inaccurate. More precisely:
- Radeon 8060S had an outdated AMD driver; Vulkan compute shaders couldn’t handle qwen35moe token generation
- Fixed in driver 26.2.2 (released 2026-02-17)
- Old driver also had broken memory management, placing data in shared memory instead of VRAM
The garbage tokens from Ollama’s ROCm may also be driver-related, but ROCm wasn’t re-tested after the driver update, so it’s unconfirmed.
For Anyone Who Bought a GMKtec EVO-X2
The AMD Software pre-installed on EVO-X2 has no auto-update feature. Without manually updating the driver, you’re stuck with the outdated version from the factory.
Download the latest AMD Software from AMD’s official site and install it — you’ll switch to a version with auto-update. RDNA 4 (gfx1151) is a new architecture with driver maturity still improving, so staying on the latest driver is strongly recommended.
Local LLM with a specific model not working, crashes on Vulkan, memory management behaving oddly — first check the driver.
Test Summary
| Backend | Old driver | Driver 26.2.2 |
|---|---|---|
| Vulkan (LM Studio / Q4_K_M) | Eval crash | Normal (35 t/s) |
| Vulkan (LM Studio / Q6_K) | Eval crash | Normal (41 t/s) |
| Vulkan (llama-server / Q6_K) | Eval crash | Normal (35 t/s) |
| Vulkan (llama-server / abliterated Q4_K_M) | Eval crash | Normal (49 t/s) |
| Vulkan (llama-server / abliterated Q6_K) | Eval crash | Normal (54 t/s) |
| ROCm (Ollama) | Garbage tokens | Untested |
| CPU (llama-server) | Normal (14 t/s) | Normal (14 t/s) |
To belabor the point: update your driver.