Setting Up a Local LLM on the GMKtec EVO-X2 (Strix Halo)

Update (2026-03-01): At the time of testing, my AMD driver was outdated, which caused Vulkan memory management and GPU inference issues with some models. Updating the driver to 26.2.2 or later fixes these problems. See The Reason Qwen 3.5 Completely Failed on Radeon 8060S Was the AMD Driver for details.

I wanted a local LLM for character RP that could handle NSFW content. Cloud APIs randomly block NSFW words, and I wanted full control over character speech patterns. I happened to have a GMKtec EVO-X2 (Strix Halo mini PC) on hand, so I decided to set up an LLM server on it.

Item	Spec
CPU	AMD Ryzen AI Max+ 395 (16 cores / 32 threads)
RAM	64GB LPDDR5X (unified memory)
iGPU	Radeon 8060S

The key feature of Strix Halo is its unified memory architecture where CPU and GPU share the same memory pool. Same idea as Apple Silicon, and it works great for LLM inference.

Initial Troubles

Only 32GB of Memory Recognized

I should have had 64GB, but only 32GB was usable. Turns out the factory settings had reserved half the memory for the GPU.

Fix:

Mash ESC during reboot to enter BIOS
Advanced → GFX Configuration
iGPU Configuration → UMA_SPECIFIED
UMA Frame buffer size → Change to 2G
F10 → Save & Reset

This brought usable memory up to 61.6GB. For details on VRAM allocation, see the memory allocation article.

Performance Mode Wasn’t Enabled

Running a 72B model only gave me 2 tokens/s. The power mode was set to Quiet/Balanced.

The EVO-X2 has a P-MODE button on the front (separate from the power button). Press it until you see the red meter icon (Performance mode).

Ollama vs LM Studio: LM Studio Is the Only Choice on Strix Halo

I tried Ollama first, but it doesn’t play nice with Strix Halo.

Runtime	Load	GPU Inference	Notes
LM Studio	OK	OK	GPU inference works via Vulkan through shared memory
Ollama	Fail	-	Crashes on load (even without GPU)

The reason comes down to the backend:

LM Studio: Vulkan backend → correctly detects the Strix Halo GPU
Ollama: ROCm backend → doesn’t support Strix Halo (gfx1151)

When Ollama can’t find a supported GPU, it falls back to CPU inference, but sometimes it crashes during loading before even getting there. You can try overriding the version with the HSA_OVERRIDE_GFX_VERSION environment variable, but it’s unreliable.

Bottom line: there’s no reason to use Ollama on the EVO-X2 (Strix Halo).

LM Studio Setup

Installation and Server Startup

Download and install the Windows version from lmstudio.ai
Select “Local Server” from the left menu
Load a model
Click “Start Server” to launch the API server (default: localhost:1234)

External Access Configuration

Enable “Serve on Local Network” in LM Studio’s settings. This makes it listen on 0.0.0.0:1234, allowing connections via Tailscale.

Two Types of LM Studio API Endpoints

LM Studio has two types of API endpoints.

Endpoint	Format	Purpose
`/v1/chat/completions`	OpenAI-compatible	Recommended - conversation history is properly recognized
`/api/v1/chat`	Native API	Conversations are concatenated as text, losing context

OpenAI-Compatible API (Recommended)

{
  "model": "ms3.2-24b-magnum-diamond",
  "messages": [
    {"role": "system", "content": "システムプロンプト"},
    {"role": "user", "content": "こんにちは"},
    {"role": "assistant", "content": "やっほー！"},
    {"role": "user", "content": "元気？"}
  ],
  "temperature": 0.4,
  "max_tokens": 100
}

Response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "元気だよ〜！"
      }
    }
  ]
}

Native API (Not Recommended)

{
  "model": "...",
  "system_prompt": "...",
  "input": "ユーザー: こんにちは\nキャラ: やっほー\nユーザー: 元気？\nキャラ:"
}

Conversations are just concatenated as text, making it hard for the model to understand context. For character RP where conversational flow matters, use the OpenAI-compatible API.

Model Comparison and Selection

I tested several models against these requirements: NSFW support, Japanese language, and character RP capability.

Model	NSFW	Japanese	Notes
huihui-mistral-abliterated	OK	Fair	Chinese text leaking in, inconsistent speech patterns
MS3.2-24B-Magnum-Diamond	OK	Fair	Currently using. Tends toward formal/refined speech, verbose
PaintedFantasy-v4-24B	?	Bad	Japanese completely broken (romanized output)
Gemma 27B uncensored	No	-	Complete refusal: “I don’t want to talk about that”
Cydonia-24B-v4.1	No	-	NSFW refusal
Umievo-Gleipnir-7B	No	OK	Natural Japanese but refuses NSFW

Even abliterated/uncensored Gemma models have limits. The training data itself has NSFW content stripped out, so even with censorship removed, the model simply “doesn’t know” the content. More on this in the Gemma section of the memory allocation article.

MS3.2-24B-Magnum-Diamond came out on top for NSFW support, conversational coherence, and passable Japanese. The Magnum family is trained for English creative writing and roleplay, which gives it strong NSFW tolerance.

Final Configuration

Item	Setting
Model	MS3.2-24B-Magnum-Diamond (GGUF Q4_K_M)
Runtime	LM Studio
Endpoint	`/v1/chat/completions` (OpenAI-compatible)
temperature	0.4
max_tokens	100

Confirmed Working

No blocks on NSFW words
Understands conversational context
Follows character settings in conversation

Speed

~11 tokens/s (GPU inference)
Short replies come back in a few seconds

Remaining Issues

Speech pattern problem: Magnum tends to slip into a refined/formal feminine Japanese style. System prompt instructions to avoid this don’t fully work
Verbosity problem: Even with max_tokens=100, responses sometimes run to 3-4 sentences
Japanese + NSFW compatibility: Haven’t found a model that nails both perfectly. Casual Japanese conversation is underrepresented in training data, so speech patterns tend to sound unnatural

Future candidates to test: Lumimaid-Magnum-v4-12B, Vecteus-v1, big-tiger-gemma-27b-v3-heretic-v2.

Ollama-era testing notes (2026/02/04)

Notes from when I was still testing with Ollama before migrating to LM Studio.

Qwen 32B vs 72B character test:

Item	32B	72B
JSON format	Ignored	✓
Response length	Too long	✓ Appropriate
Conversation naturalness	Formulaic	✓ Natural
Instruction following	Weak	✓ Good
NSFW support	✓	✓

72B had great quality but at ~1.5 t/s on Ollama’s CPU inference, it wasn’t practical. Even on an M1 Max 64GB, it topped out at 5.3 t/s.

After migrating to LM Studio’s GPU inference, the 24B model hit 11 t/s, eliminating the need for the 72B.