Optimizing VRAM and Memory Allocation on Strix Halo for Local LLMs

Update (2026-03-01): The “shared memory priority” issue described in this article was likely caused by an older AMD driver version. Updating the driver to 26.2.2 or later places data correctly in VRAM, eliminating the need to reduce VRAM allocation. See Qwen 3.5 Failing on Radeon 8060S Was Caused by AMD Drivers for details.

This is a follow-up to Setting Up a Local LLM Environment on the EVO-X2. Here I cover the memory allocation pitfalls I hit on Strix Halo and how to fix them. More dedicated VRAM is not always better — in fact, reducing it can work better.

Strix Halo’s Unified Memory

The EVO-X2 (Strix Halo) shares 64GB of LPDDR5X between the CPU and GPU in a unified memory architecture. Same idea as Apple Silicon, but you need to explicitly split the memory between “dedicated VRAM” and “main memory” in the BIOS.

This split has a huge impact on LLM inference stability and speed.

BIOS VRAM Allocation and Its Trap

The factory default sets dedicated VRAM to 32GB. You’d think more VRAM means better LLM performance, but the opposite is true.

Split	Dedicated VRAM	Main Memory	Result
48GB/16GB	48GB	16GB	Crashes during load — not enough main memory
32GB/32GB	32GB	32GB	Balanced, stable
16GB/48GB	16GB	48GB	Recommended. Stable loading, no speed penalty for overflow
8GB/56GB	8GB	56GB	Verified. 29.6GB model runs fine

The “Shared GPU Memory Fills First” Problem

What happened with a 48GB VRAM configuration:

Dedicated VRAM (48GB) was mostly empty
Shared GPU memory (main memory side) was maxed out
Total memory had plenty of room, but loading failed

LM Studio (Vulkan) treats Strix Halo as an integrated GPU (iGPU) and prioritizes shared memory over dedicated VRAM. If the main memory allocation is small, it hits the wall quickly.

Peak Memory Consumption During Model Loading

When loading a model, data always passes through main memory before being transferred to VRAM.

Disk → Main Memory (temp staging/conversion) → VRAM transfer
        ↑ Bottleneck here

Peak memory consumption when loading a 20GB model:

Breakdown	Usage
OS baseline	~10GB
Model data	20GB
Conversion buffer (temporary)	20GB
Peak total	~50GB

With only 16GB of main memory, the process crashes during this temporary staging phase. The issue isn’t the model’s final VRAM footprint — it’s the width of the “hallway” during loading.

Solutions

1. Reduce VRAM Allocation (Counterintuitive Fix)

Set VRAM to 8–16GB / main memory to 48–56GB in the BIOS.

Wider “hallway” during loading — no more crashes
Data that doesn’t fit in dedicated VRAM spills into shared memory (main memory)
Strix Halo uses unified memory, so shared memory runs at the same speed

BIOS configuration steps:

Spam ESC during reboot to enter BIOS
Advanced → GFX Configuration
iGPU Configuration → UMA_SPECIFIED
UMA Frame buffer size → 8G (or 16G)
F10 → Save & Reset

2. Add Virtual Memory (Page File)

As a safety net for loading spikes, allocate virtual memory on the SSD.

Windows search → “Advanced system settings”
Advanced tab → Performance “Settings”
Advanced tab → Virtual Memory “Change”
Uncheck “Automatically manage paging file size for all drives”
Select C: drive → Custom size:
- Initial size: 32768 (32GB)
- Maximum size: 65536 (64GB)
Click “Set” → OK → Reboot

3. LM Studio Settings

Setting	Recommended	Effect
mmap (memory mapping)	OFF	Frees main memory copy immediately after loading
Keep model in memory	OFF	Don’t keep a backup in main memory
Offload KV cache to GPU memory	Depends	See table below

Where to put the KV cache depends on the VRAM split:

VRAM Split	KV Cache	Reason
48GB/16GB	ON (in VRAM)	VRAM has room, saves main memory
8–16GB/48–56GB	OFF (in main memory)	Main memory has room

Real Test: VRAM 8GB / Main Memory 56GB

Actual measurements with big-tiger-gemma-27b-v3-heretic-v2 (29.6GB Q8_0):

Item	Value
Context length	35,870
GPU offload	25 layers
K Cache	Q8_0 quantization
Memory usage	Shared 12.8GB + Dedicated 7.7GB = ~20.5GB

Even with just 8GB of dedicated VRAM, LM Studio used shared memory for GPU inference and ran the model without issues.

Gemma 3’s Memory Consumption Problem

Gemma 3 consumes an abnormal amount of memory compared to other models.

The cause is its vocabulary size (Vocab Size). Gemma 3 has 256k tokens, twice the 128k of Llama 3. When you increase the context length, the KV cache grows explosively.

Mitigating with KV Cache Quantization

Config	K Cache	V Cache	Notes
Recommended	q4_0	f16	Compressing V tends to break outputs
Low memory	q4_0	q8_0	Don’t drop V below q8_0

V Cache quantization directly affects output quality. Dropping it to q4_0 causes the model to lose context and produce garbled responses. K Cache can go down to q4_0 with minimal quality impact.

Gemma’s Censorship Problem

The Gemma family has some of the strictest safety guardrails of any model. Even before NSFW content, it refuses to engage with mildly sensitive topics.

Abliterated/uncensored variants have their limits too. Even after removing the censorship layers, NSFW data was excluded from the training data itself, so the model simply “doesn’t know” about those topics.

Alternative Models

Models outside Gemma that handle NSFW content better:

Command R (35B): Obedient, fluent Japanese, 128k context
Mistral Nemo (12B): Lightweight, less censored, good Japanese
Llama-3.1-70B Abliterated: Best performance, no censorship (but very memory-hungry)

For now, MS3.2-24B-Magnum-Diamond hits the best balance overall.