Tech 5 min read

Optimizing VRAM and Memory Allocation on Strix Halo for Local LLMs

Update (2026-03-01): The “shared memory priority” issue described in this article was likely caused by an older AMD driver version. Updating the driver to 26.2.2 or later places data correctly in VRAM, eliminating the need to reduce VRAM allocation. See Qwen 3.5 Failing on Radeon 8060S Was Caused by AMD Drivers for details.

This is a follow-up to Setting Up a Local LLM Environment on the EVO-X2. Here I cover the memory allocation pitfalls I hit on Strix Halo and how to fix them. More dedicated VRAM is not always better — in fact, reducing it can work better.

Related articles:

Strix Halo’s Unified Memory

The EVO-X2 (Strix Halo) shares 64GB of LPDDR5X between the CPU and GPU in a unified memory architecture. Same idea as Apple Silicon, but you need to explicitly split the memory between “dedicated VRAM” and “main memory” in the BIOS.

This split has a huge impact on LLM inference stability and speed.

BIOS VRAM Allocation and Its Trap

The factory default sets dedicated VRAM to 32GB. You’d think more VRAM means better LLM performance, but the opposite is true.

SplitDedicated VRAMMain MemoryResult
48GB/16GB48GB16GBCrashes during load — not enough main memory
32GB/32GB32GB32GBBalanced, stable
16GB/48GB16GB48GBRecommended. Stable loading, no speed penalty for overflow
8GB/56GB8GB56GBVerified. 29.6GB model runs fine

The “Shared GPU Memory Fills First” Problem

What happened with a 48GB VRAM configuration:

  • Dedicated VRAM (48GB) was mostly empty
  • Shared GPU memory (main memory side) was maxed out
  • Total memory had plenty of room, but loading failed

LM Studio (Vulkan) treats Strix Halo as an integrated GPU (iGPU) and prioritizes shared memory over dedicated VRAM. If the main memory allocation is small, it hits the wall quickly.

Peak Memory Consumption During Model Loading

When loading a model, data always passes through main memory before being transferred to VRAM.

Disk → Main Memory (temp staging/conversion) → VRAM transfer
        ↑ Bottleneck here

Peak memory consumption when loading a 20GB model:

BreakdownUsage
OS baseline~10GB
Model data20GB
Conversion buffer (temporary)20GB
Peak total~50GB

With only 16GB of main memory, the process crashes during this temporary staging phase. The issue isn’t the model’s final VRAM footprint — it’s the width of the “hallway” during loading.

Solutions

1. Reduce VRAM Allocation (Counterintuitive Fix)

Set VRAM to 8–16GB / main memory to 48–56GB in the BIOS.

  • Wider “hallway” during loading — no more crashes
  • Data that doesn’t fit in dedicated VRAM spills into shared memory (main memory)
  • Strix Halo uses unified memory, so shared memory runs at the same speed

BIOS configuration steps:

  1. Spam ESC during reboot to enter BIOS
  2. Advanced → GFX Configuration
  3. iGPU Configuration → UMA_SPECIFIED
  4. UMA Frame buffer size → 8G (or 16G)
  5. F10 → Save & Reset

2. Add Virtual Memory (Page File)

As a safety net for loading spikes, allocate virtual memory on the SSD.

  1. Windows search → “Advanced system settings”
  2. Advanced tab → Performance “Settings”
  3. Advanced tab → Virtual Memory “Change”
  4. Uncheck “Automatically manage paging file size for all drives”
  5. Select C: drive → Custom size:
    • Initial size: 32768 (32GB)
    • Maximum size: 65536 (64GB)
  6. Click “Set” → OK → Reboot

3. LM Studio Settings

SettingRecommendedEffect
mmap (memory mapping)OFFFrees main memory copy immediately after loading
Keep model in memoryOFFDon’t keep a backup in main memory
Offload KV cache to GPU memoryDependsSee table below

Where to put the KV cache depends on the VRAM split:

VRAM SplitKV CacheReason
48GB/16GBON (in VRAM)VRAM has room, saves main memory
8–16GB/48–56GBOFF (in main memory)Main memory has room

Real Test: VRAM 8GB / Main Memory 56GB

Actual measurements with big-tiger-gemma-27b-v3-heretic-v2 (29.6GB Q8_0):

ItemValue
Context length35,870
GPU offload25 layers
K CacheQ8_0 quantization
Memory usageShared 12.8GB + Dedicated 7.7GB = ~20.5GB

Even with just 8GB of dedicated VRAM, LM Studio used shared memory for GPU inference and ran the model without issues.

Gemma 3’s Memory Consumption Problem

Gemma 3 consumes an abnormal amount of memory compared to other models.

The cause is its vocabulary size (Vocab Size). Gemma 3 has 256k tokens, twice the 128k of Llama 3. When you increase the context length, the KV cache grows explosively.

Mitigating with KV Cache Quantization

ConfigK CacheV CacheNotes
Recommendedq4_0f16Compressing V tends to break outputs
Low memoryq4_0q8_0Don’t drop V below q8_0

V Cache quantization directly affects output quality. Dropping it to q4_0 causes the model to lose context and produce garbled responses. K Cache can go down to q4_0 with minimal quality impact.

Gemma’s Censorship Problem

The Gemma family has some of the strictest safety guardrails of any model. Even before NSFW content, it refuses to engage with mildly sensitive topics.

Abliterated/uncensored variants have their limits too. Even after removing the censorship layers, NSFW data was excluded from the training data itself, so the model simply “doesn’t know” about those topics.

Alternative Models

Models outside Gemma that handle NSFW content better:

  • Command R (35B): Obedient, fluent Japanese, 128k context
  • Mistral Nemo (12B): Lightweight, less censored, good Japanese
  • Llama-3.1-70B Abliterated: Best performance, no censorship (but very memory-hungry)

For now, MS3.2-24B-Magnum-Diamond hits the best balance overall.