Optimizing VRAM and Memory Allocation on Strix Halo for Local LLMs
Update (2026-03-01): The “shared memory priority” issue described in this article was likely caused by an older AMD driver version. Updating the driver to 26.2.2 or later places data correctly in VRAM, eliminating the need to reduce VRAM allocation. See Qwen 3.5 Failing on Radeon 8060S Was Caused by AMD Drivers for details.
This is a follow-up to Setting Up a Local LLM Environment on the EVO-X2. Here I cover the memory allocation pitfalls I hit on Strix Halo and how to fix them. More dedicated VRAM is not always better — in fact, reducing it can work better.
Related articles:
Strix Halo’s Unified Memory
The EVO-X2 (Strix Halo) shares 64GB of LPDDR5X between the CPU and GPU in a unified memory architecture. Same idea as Apple Silicon, but you need to explicitly split the memory between “dedicated VRAM” and “main memory” in the BIOS.
This split has a huge impact on LLM inference stability and speed.
BIOS VRAM Allocation and Its Trap
The factory default sets dedicated VRAM to 32GB. You’d think more VRAM means better LLM performance, but the opposite is true.
| Split | Dedicated VRAM | Main Memory | Result |
|---|---|---|---|
| 48GB/16GB | 48GB | 16GB | Crashes during load — not enough main memory |
| 32GB/32GB | 32GB | 32GB | Balanced, stable |
| 16GB/48GB | 16GB | 48GB | Recommended. Stable loading, no speed penalty for overflow |
| 8GB/56GB | 8GB | 56GB | Verified. 29.6GB model runs fine |
The “Shared GPU Memory Fills First” Problem
What happened with a 48GB VRAM configuration:
- Dedicated VRAM (48GB) was mostly empty
- Shared GPU memory (main memory side) was maxed out
- Total memory had plenty of room, but loading failed
LM Studio (Vulkan) treats Strix Halo as an integrated GPU (iGPU) and prioritizes shared memory over dedicated VRAM. If the main memory allocation is small, it hits the wall quickly.
Peak Memory Consumption During Model Loading
When loading a model, data always passes through main memory before being transferred to VRAM.
Disk → Main Memory (temp staging/conversion) → VRAM transfer
↑ Bottleneck here
Peak memory consumption when loading a 20GB model:
| Breakdown | Usage |
|---|---|
| OS baseline | ~10GB |
| Model data | 20GB |
| Conversion buffer (temporary) | 20GB |
| Peak total | ~50GB |
With only 16GB of main memory, the process crashes during this temporary staging phase. The issue isn’t the model’s final VRAM footprint — it’s the width of the “hallway” during loading.
Solutions
1. Reduce VRAM Allocation (Counterintuitive Fix)
Set VRAM to 8–16GB / main memory to 48–56GB in the BIOS.
- Wider “hallway” during loading — no more crashes
- Data that doesn’t fit in dedicated VRAM spills into shared memory (main memory)
- Strix Halo uses unified memory, so shared memory runs at the same speed
BIOS configuration steps:
- Spam ESC during reboot to enter BIOS
- Advanced → GFX Configuration
- iGPU Configuration → UMA_SPECIFIED
- UMA Frame buffer size → 8G (or 16G)
- F10 → Save & Reset
2. Add Virtual Memory (Page File)
As a safety net for loading spikes, allocate virtual memory on the SSD.
- Windows search → “Advanced system settings”
- Advanced tab → Performance “Settings”
- Advanced tab → Virtual Memory “Change”
- Uncheck “Automatically manage paging file size for all drives”
- Select C: drive → Custom size:
- Initial size: 32768 (32GB)
- Maximum size: 65536 (64GB)
- Click “Set” → OK → Reboot
3. LM Studio Settings
| Setting | Recommended | Effect |
|---|---|---|
| mmap (memory mapping) | OFF | Frees main memory copy immediately after loading |
| Keep model in memory | OFF | Don’t keep a backup in main memory |
| Offload KV cache to GPU memory | Depends | See table below |
Where to put the KV cache depends on the VRAM split:
| VRAM Split | KV Cache | Reason |
|---|---|---|
| 48GB/16GB | ON (in VRAM) | VRAM has room, saves main memory |
| 8–16GB/48–56GB | OFF (in main memory) | Main memory has room |
Real Test: VRAM 8GB / Main Memory 56GB
Actual measurements with big-tiger-gemma-27b-v3-heretic-v2 (29.6GB Q8_0):
| Item | Value |
|---|---|
| Context length | 35,870 |
| GPU offload | 25 layers |
| K Cache | Q8_0 quantization |
| Memory usage | Shared 12.8GB + Dedicated 7.7GB = ~20.5GB |
Even with just 8GB of dedicated VRAM, LM Studio used shared memory for GPU inference and ran the model without issues.
Gemma 3’s Memory Consumption Problem
Gemma 3 consumes an abnormal amount of memory compared to other models.
The cause is its vocabulary size (Vocab Size). Gemma 3 has 256k tokens, twice the 128k of Llama 3. When you increase the context length, the KV cache grows explosively.
Mitigating with KV Cache Quantization
| Config | K Cache | V Cache | Notes |
|---|---|---|---|
| Recommended | q4_0 | f16 | Compressing V tends to break outputs |
| Low memory | q4_0 | q8_0 | Don’t drop V below q8_0 |
V Cache quantization directly affects output quality. Dropping it to q4_0 causes the model to lose context and produce garbled responses. K Cache can go down to q4_0 with minimal quality impact.
Gemma’s Censorship Problem
The Gemma family has some of the strictest safety guardrails of any model. Even before NSFW content, it refuses to engage with mildly sensitive topics.
Abliterated/uncensored variants have their limits too. Even after removing the censorship layers, NSFW data was excluded from the training data itself, so the model simply “doesn’t know” about those topics.
Alternative Models
Models outside Gemma that handle NSFW content better:
- Command R (35B): Obedient, fluent Japanese, 128k context
- Mistral Nemo (12B): Lightweight, less censored, good Japanese
- Llama-3.1-70B Abliterated: Best performance, no censorship (but very memory-hungry)
For now, MS3.2-24B-Magnum-Diamond hits the best balance overall.