Radeon 8060S (gfx1151) Vulkan Broke Again After AMD Driver Update
Last month, updating the AMD driver to 26.2.2 finally got Vulkan GPU inference working properly. The abliterated Qwen 3.5 hit 54 t/s, the VRAM shared memory priority issue was resolved, and I thought the EVO-X2’s GPU was finally usable.
Then I updated to AMD Software 26.3.1, and it broke again. Loading a model via Vulkan reports “success,” but in reality the data never lands in VRAM and everything falls back to CPU processing through system RAM. The same model with the same settings that got 34-54 t/s on driver 26.2.2 degraded to 4-9 t/s. Further investigation revealed that repeated Vulkan load failures leak device memory, and it never recovers until you reboot. Right after a fresh reboot you get 50+ t/s, so there’s almost certainly a bug in the driver’s memory management. Ultimately, changing the BIOS VRAM allocation from 48GB/16GB to 32GB/32GB got Q6_K running stably at 53.7 t/s.
Environment
| Item | Value |
|---|---|
| PC | GMKtec EVO-X2 (NucBox_EVO-X2) |
| CPU | AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T) |
| GPU | AMD Radeon 8060S (gfx1151, RDNA 3.5) |
| Memory | 64GB UMA (System 16GB / VRAM 48GB) |
| OS | Windows 11 Pro 10.0.26200 |
| AMD Driver | 32.0.23033.1002 (built: 2026-03-09) |
| AMD Software | 26.3.1 (installed: 2026-03-22) |
| LM Studio | 0.4.8 |
| llama-server | b8183 (66d65ec29) |
The Ryzen AI Max+ 395 is an APU with CPU and GPU on the same die, sharing memory in a UMA (Unified Memory Architecture) configuration. The 64GB of physical memory is shared between system and VRAM, with BIOS set to allocate 48GB to VRAM. UMA details and VRAM allocation optimization are covered in a previous article. The “VRAM gets placed in shared memory instead of dedicated” issue reported there was fixed in driver 26.2.2, but it’s back in 26.3.1. Unlike discrete GPUs, the CPU and GPU share the same memory bus, so when the Vulkan driver’s memory management goes wrong, the impact is severe.
Symptoms
1. GPU Not Used (CPU Fallback)
Loading a model with the Vulkan backend reports success, but it actually runs on CPU through system RAM instead of GPU compute.
| Test | Speed | Measured on 26.2.2 |
|---|---|---|
| b8183 Vulkan Q4_K_XL (20.7GB) | 9.5 t/s | 35 t/s |
b8183 Vulkan Q6_K + --no-direct-io | 4.4 t/s | 41 t/s |
| b8183 CPU IQ2_XXS (9.8GB) | 13.4 t/s | 14.4 t/s |
In the previous benchmarks with driver 26.2.2, Q4_K_M hit 35 t/s, Q6_K hit 41 t/s, and the abliterated Q6_K went up to 54 t/s. Now through Vulkan it’s slower than pure CPU inference. The model data is sitting in system RAM instead of VRAM, and GPU kernels are reading from system RAM (or not running at all). Main memory usage is abnormally high, confirming that data isn’t in VRAM.
2. ErrorOutOfDeviceMemory
Vulkan reports 54,305 MiB free, yet can’t allocate the 26.8GB Q6_K model.
vk::Queue::submit: ErrorOutOfDeviceMemory
VK_ERROR_OUT_OF_DEVICE_MEMORY is a Vulkan API error code meaning device (GPU) memory allocation failed. Getting this error with plenty of free space means the driver is either mismanaging memory pools, or mapping memory types incorrectly in the UMA environment.
3. Garbage Output (mmap=true)
Loading Q4_K_XL with mmap=true produces completely garbled model output.
说完ak觉得 ,一ば った跟不会 . sieme 只言却 ...
A meaningless jumble of Chinese, Japanese, and English. This is a classic symptom of model weight data not being correctly transferred to VRAM (or getting corrupted during transfer). The driver is likely miscalculating addresses or DMA transfers when copying mmap’d host memory data to Vulkan buffers.
Setting mmap=false fixes the output but severely degrades speed. In the previous benchmarks, using --no-mmap reduced system memory usage from 28GB to 8.8GB and slightly improved speed, so it’s natural to conclude the mmap-related driver implementation regressed in 26.3.1.
4. Latest Build (b8560) Makes Things Worse
Updating to llama.cpp’s latest build b8560 made things even worse.
ggml_vulkan: Device memory allocation of size 907872256 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
Can’t even allocate 0.9GB (~907MB). On b8183, 20GB-class models could load (slowly), but on b8560 even gigabyte-scale allocations fail. Could be a change in llama.cpp’s Vulkan memory allocation strategy, but it’s compounding with the driver-side issue.
5. Device Memory Leak
While repeatedly testing the above symptoms, another nasty problem surfaced. Repeated Vulkan load failures leak device memory with no recovery.
Running the same test right after a reboot reproduces the article’s benchmark performance (50+ t/s). But after a few load failures or ErrorOutOfDeviceMemory errors, no model can properly allocate VRAM anymore. Even after killing LM Studio or llama-server processes, VRAM usage doesn’t return to normal.
So the speed degradation and allocation errors above aren’t “always happening” — they’re “once it fails, the whole session is poisoned.” The driver fails to release Vulkan resources and never returns the allocated device memory. In a UMA environment, VRAM leaks directly impact system-wide memory availability since VRAM and system RAM share the same physical memory pool.
This discovery changes the interpretation of the test results. The 9.5 t/s and 4.4 t/s numbers from earlier were likely measured in a state with accumulated memory leaks. In a clean state right after reboot, the numbers might be somewhat better. Still unstable either way, though.
Timeline
graph TD
A["3/1: Driver 26.2.2<br/>Q6_K Vulkan 34-48 t/s<br/>Working normally"] --> B["3/9: Current driver build date"]
B --> C["3/13: AMD chipset driver update"]
C --> D["3/16: Q6_K load attempt<br/>ErrorOutOfDeviceMemory"]
D --> E["3/22: AMD Software 26.3.1<br/>installed"]
E --> F["3/23: Q4_K_XL mmap=false<br/>19.5 t/s works but slow"]
F --> G["3/27: LM Studio 0.4.8 update"]
G --> H["3/28: Q4_K_XL mmap=true<br/>22.6 t/s but garbage output"]
H --> I["3/28: Q5_K_XL / Q6_K<br/>ErrorOutOfDeviceMemory"]
I --> J["3/28: Memory leak discovered<br/>Reboot restores 50+ t/s"]
style A fill:#4CAF50,color:#fff
style D fill:#FF9800,color:#fff
style H fill:#f44336,color:#fff
style I fill:#f44336,color:#fff
style J fill:#2196F3,color:#fff
| Date | Event | Vulkan Performance |
|---|---|---|
| 3/1 | Article written. Driver 26.2.2 | 34-48 t/s, working normally |
| 3/9 | Current driver build date | - |
| 3/13 | AMD chipset driver update | - |
| 3/16 | Q6_K load attempt | ErrorOutOfDeviceMemory |
| 3/22 | AMD Software 26.3.1 installed | - |
| 3/23 | Q4_K_XL load (mmap=false) | 19.5 t/s (works but halved) |
| 3/27 | LM Studio 0.4.8 update | - |
| 3/28 | Q4_K_XL (mmap=true) | 22.6 t/s but garbage output |
| 3/28 | Q5_K_XL / Q6_K | ErrorOutOfDeviceMemory |
| 3/28 | Memory leak discovered | Reboot restores 50+ t/s |
A few notable points.
Q6_K was already failing with ErrorOutOfDeviceMemory on 3/16. The chipset driver update on 3/13 might have been the trigger. AMD Software 26.3.1 wasn’t installed until 3/22, so the problem predates that.
On 3/23, Q4_K_XL ran at 19.5 t/s, but the same model was doing 35 t/s in early March. It looked like it was “working,” but performance had already halved. VRAM placement was probably only partially functional. Given the memory leak issue, leaks had likely already accumulated by this point.
On 3/28 I confirmed the memory leak. Since 50+ t/s comes back right after a reboot, the driver’s memory management does work correctly in its initial state. Leaks start after load failures or errors and accumulate throughout the session.
Isolating the Problem
Here’s the evidence for whether this is a driver issue or a llama.cpp issue.
Evidence Pointing to Driver
- Same llama.cpp build (b8183) and same models worked fine on driver 26.2.2 in early March. Measured values of 35 t/s for Q4_K_M, 41 t/s for Q6_K, and 54 t/s for abliterated Q6_K are on record
- The shared memory priority issue fixed in driver 26.2.2 has resurfaced with the same symptoms
ErrorOutOfDeviceMemorywith plenty of free VRAM (driver memory management issue)- Data corruption with
mmap=true(host-device memory transfer issue). In the previous benchmarks,--no-mmapcould avoid double-mapping system memory, but this time things are broken at an earlier stage - 50+ t/s right after reboot. Degrades as leaks accumulate during the session (driver resource release issue)
- gfx1151 is a new RDNA 3.5 chip with immature driver support
Possible llama.cpp Contribution
- Things got worse on b8560 (possible Vulkan memory allocation strategy change)
- But b8183 was already degraded, so it’s not the root cause
UMA-Specific Issues
- UMA environments handle Vulkan memory types (
DEVICE_LOCAL,HOST_VISIBLE, etc.) differently from discrete GPUs - The driver may not be correctly advertising
DEVICE_LOCALmemory in the UMA environment, or selecting the wrong memory type during allocation - Driver issues in this UMA environment have been a consistent problem since the initial EVO-X2 setup. The “placed in shared memory instead of VRAM” issue found in the VRAM allocation article was confirmed fixed in driver 26.2.2, but the same symptoms are back in 26.3.1
- Memory leaks are especially severe in UMA because leaked VRAM also impacts system RAM availability. With a discrete GPU, a GPU memory leak leaves system RAM untouched, but with UMA it eats into the shared pool
Related Issues
The gfx1151 Vulkan problems aren’t just me — they’ve been reported across multiple projects.
| Issue | Description | Status |
|---|---|---|
| lmstudio-ai/lmstudio-bug-tracker#1048 | Vulkan v1.52.0 doesn’t use VRAM on gfx1151. 3x slower | fixed-in-next-update |
| ggml-org/llama.cpp#16832 | Vulkan UMA memory detection bug (only recognizes 32GB) | Closed (PR #17110) |
| ggml-org/llama.cpp#20354 | Missing GATED_DELTA_NET shader in Vulkan (gfx1151) | Open |
| ggml-org/llama.cpp#18741 | Strix Halo Vulkan load failure. Workaround: --no-direct-io | Closed |
| GPUOpen-Drivers/AMDVLK#413 | AMDVLK 2GB allocation limit | Open |
| LostRuins/koboldcpp#1980 | koboldcpp also can’t detect 8060S VRAM | - |
The same problems appear in koboldcpp and LM Studio, not just llama.cpp, which confirms this is a driver-side issue. The AMDVLK#413 2GB allocation limit could be directly related to the ErrorOutOfDeviceMemory seen here.
What I Tried
| Workaround | Result |
|---|---|
--no-mmap | Load succeeds but GPU not used (9.5 t/s) |
--no-direct-io | Q6_K loads but GPU not used (4.4 t/s) |
| llama.cpp b8560 (latest) | Worse. Can’t even allocate 0.9GB |
| LM Studio mmap OFF | Gets mmap=false but ErrorOutOfDeviceMemory |
| PC reboot | Clears memory leak, 50+ t/s restored |
Post-Reboot Test Results
After discovering the memory leak, I rebooted and ran the same command from the article in a clean state.
llama-server.exe -m "Huihui-Qwen3.5-35B-A3B-abliterated.Q4_K_M.gguf" --port 8080 --ctx-size 4096 --reasoning-budget 0 --n-gpu-layers 99 --no-mmap
| Item | Result |
|---|---|
| Load | Success (Vulkan0 = 19,905 MiB) |
| Generation speed | 54.9 - 57.7 t/s |
| Article benchmark | 49.18 t/s |
| Output quality | Normal (Japanese output fine) |
Right after reboot, speeds exceeded the article’s benchmark. However, 272 MiB was placed in Vulkan_Host, so data isn’t entirely in VRAM (some goes through shared memory). On driver 26.2.2, no Vulkan_Host spillover occurred, confirming that 26.3.1 changed the memory placement logic.
Reproducing the Memory Leak
Comparison before and after reboot.
| State | Result |
|---|---|
| Before reboot | Device memory allocation of size 1025582592 failed — 1GB allocation failed |
| After reboot | Vulkan0 model buffer size = 19905.15 MiB — 19.9GB allocated successfully, 54.9 t/s |
Repeated Vulkan load failures (ErrorOutOfDeviceMemory) cause the driver to never release device memory. Eventually even 0.9GB can’t be allocated. Only a PC reboot recovers it. This means that “fail → retry” loops during testing progressively worsen the state. Large model load attempts need to be done carefully.
Root Cause
Investigation revealed multiple overlapping issues.
- AMD Driver 26.3.1 Vulkan Memory Management Regression: The driver (built 2026-03-09) regressed Vulkan device memory management for the Radeon 8060S UMA. The driver reports 54GB of free Vulkan memory, but in practice some allocations land in shared memory, and large models (Q5_K_XL 24.6GB+) fail with ErrorOutOfDeviceMemory
- Vulkan Device Memory Leak: The driver doesn’t release memory on load failure. Repeated attempts exhaust device memory until even small allocations fail. Only recoverable by PC reboot
- Shared Memory Fallback: Part of the model data (Vulkan_Host buffer) gets placed in shared memory instead of VRAM. This didn’t happen on driver 26.2.2. Performance is acceptable for now, but it’s not the “shared memory not used” state
Fix: Change BIOS VRAM Allocation to 32GB/32GB
Changing the BIOS from 48GB VRAM / 16GB system to 32GB VRAM / 32GB system resolved the ErrorOutOfDeviceMemory.
| Model | 48/16 | 32/32 |
|---|---|---|
| Q4_K_M abliterated (19.7GB) | 54.9 t/s (only right after reboot) | Works fine |
| Q6_K (26.8GB) | ErrorOutOfDeviceMemory | 53.7 t/s |
Vulkan free memory reported values.
| Config | free memory | Q6_K |
|---|---|---|
| 48/16 | 54,305 MiB | Won’t load |
| 32/32 | 46,522 MiB | Loads (53.7 t/s) |
The reported VRAM is lower, yet the loadable model size is larger.
Why Less VRAM Loads Bigger Models
AMD driver 26.3.1 uses system RAM as a transfer buffer during model loading. With 48/16, the system side only has 16GB, and after OS + Parsec + Tailscale etc. consume about 7GB, the remaining ~9GB isn’t enough for transfer buffers, causing ErrorOutOfDeviceMemory.
With 32/32, the system side has about 25GB free, providing ample room for transfer buffers. Vulkan reports less VRAM on paper, but the effectively loadable model size increases.
graph LR
subgraph "48/16 Config"
A1["VRAM 48GB"] --> B1["Q6_K 26.8GB<br/>Load attempt"]
C1["System 16GB<br/>~9GB free"] --> D1["Transfer buffer<br/>insufficient"]
D1 --> E1["ErrorOutOfDeviceMemory"]
end
subgraph "32/32 Config"
A2["VRAM 32GB"] --> B2["Q6_K 26.8GB<br/>Load success"]
C2["System 32GB<br/>~25GB free"] --> D2["Transfer buffer<br/>sufficient"]
D2 --> B2
end
style E1 fill:#f44336,color:#fff
style B2 fill:#4CAF50,color:#fff
In the previous VRAM allocation analysis, the conclusion was “more VRAM is better.” But with driver 26.3.1, you also need headroom on the system RAM side. In a UMA environment, the VRAM/system RAM split directly affects driver behavior, so the optimal configuration can change with each driver update.
A few caveats.
- 397 MiB is placed in Vulkan_Host, so it’s not running entirely from VRAM
- On driver 26.2.2, Q6_K worked fine with 48/16, so this workaround is specific to driver 26.3.1
- Repeated Vulkan load failures cause memory leaks, so reboot immediately after a failure
Operating Procedure
- Set BIOS to 32GB VRAM / 32GB system
- Run llama-server directly with b8183 +
--no-mmap - If Vulkan load fails, reboot immediately (memory leaks make things progressively worse)
- LM Studio has separate Vulkan backend issues — direct llama-server execution recommended
Didn’t expect a BIOS settings change to fix this. Reducing VRAM from 48GB to 32GB and having Q6_K load successfully is counterintuitive. The previous VRAM allocation analysis concluded “more VRAM is better,” but after a driver change, the system RAM side needs headroom too. From the initial EVO-X2 setup to VRAM allocation analysis, the Ollama total failure, the driver update fixing everything, and now this regression plus BIOS workaround — this is the fourth round of troubleshooting. UMA + RDNA 3.5 is not mature yet.
Related Articles
- Qwen 3.5’s Radeon 8060S Total Failure Was Caused by AMD Drivers — Benchmark records from driver 26.2.2 getting 34-54 t/s. The baseline for this regression
- Optimizing VRAM and Memory Allocation on Strix Halo — VRAM allocation and memory management in UMA. First report of the shared memory priority issue
- Setting Up Local LLMs on the EVO-X2 — Initial setup and early Vulkan troubles
- Trying to Run abliterated Models on Ollama and Failing Completely — Qwen 3.5 total failure on the old driver
- Exposing Local LLM as an External API via VPN — EVO-X2 GPU inference exposed via Tailscale. When Vulkan breaks, this also falls back to CPU