Setting Up a Local LLM on the GMKtec EVO-X2 (Strix Halo)
Update (2026-03-01): At the time of testing, my AMD driver was outdated, which caused Vulkan memory management and GPU inference issues with some models. Updating the driver to 26.2.2 or later fixes these problems. See The Reason Qwen 3.5 Completely Failed on Radeon 8060S Was the AMD Driver for details.
I wanted a local LLM for character RP that could handle NSFW content. Cloud APIs randomly block NSFW words, and I wanted full control over character speech patterns. I happened to have a GMKtec EVO-X2 (Strix Halo mini PC) on hand, so I decided to set up an LLM server on it.
Related articles:
- Figuring Out VRAM and Memory Allocation on Strix Halo
- Exposing a Local LLM as an External API via VPN
Hardware: GMKtec EVO-X2
| Item | Spec |
|---|---|
| CPU | AMD Ryzen AI Max+ 395 (16 cores / 32 threads) |
| RAM | 64GB LPDDR5X (unified memory) |
| iGPU | Radeon 8060S |
The key feature of Strix Halo is its unified memory architecture where CPU and GPU share the same memory pool. Same idea as Apple Silicon, and it works great for LLM inference.
Initial Troubles
Only 32GB of Memory Recognized
I should have had 64GB, but only 32GB was usable. Turns out the factory settings had reserved half the memory for the GPU.
Fix:
- Mash ESC during reboot to enter BIOS
- Advanced → GFX Configuration
- iGPU Configuration → UMA_SPECIFIED
- UMA Frame buffer size → Change to 2G
- F10 → Save & Reset
This brought usable memory up to 61.6GB. For details on VRAM allocation, see the memory allocation article.
Performance Mode Wasn’t Enabled
Running a 72B model only gave me 2 tokens/s. The power mode was set to Quiet/Balanced.
The EVO-X2 has a P-MODE button on the front (separate from the power button). Press it until you see the red meter icon (Performance mode).
Ollama vs LM Studio: LM Studio Is the Only Choice on Strix Halo
I tried Ollama first, but it doesn’t play nice with Strix Halo.
| Runtime | Load | GPU Inference | Notes |
|---|---|---|---|
| LM Studio | OK | OK | GPU inference works via Vulkan through shared memory |
| Ollama | Fail | - | Crashes on load (even without GPU) |
The reason comes down to the backend:
- LM Studio: Vulkan backend → correctly detects the Strix Halo GPU
- Ollama: ROCm backend → doesn’t support Strix Halo (gfx1151)
When Ollama can’t find a supported GPU, it falls back to CPU inference, but sometimes it crashes during loading before even getting there. You can try overriding the version with the HSA_OVERRIDE_GFX_VERSION environment variable, but it’s unreliable.
Bottom line: there’s no reason to use Ollama on the EVO-X2 (Strix Halo).
LM Studio Setup
Installation and Server Startup
- Download and install the Windows version from lmstudio.ai
- Select “Local Server” from the left menu
- Load a model
- Click “Start Server” to launch the API server (default:
localhost:1234)
External Access Configuration
Enable “Serve on Local Network” in LM Studio’s settings. This makes it listen on 0.0.0.0:1234, allowing connections via Tailscale.
Two Types of LM Studio API Endpoints
LM Studio has two types of API endpoints.
| Endpoint | Format | Purpose |
|---|---|---|
/v1/chat/completions | OpenAI-compatible | Recommended - conversation history is properly recognized |
/api/v1/chat | Native API | Conversations are concatenated as text, losing context |
OpenAI-Compatible API (Recommended)
{
"model": "ms3.2-24b-magnum-diamond",
"messages": [
{"role": "system", "content": "システムプロンプト"},
{"role": "user", "content": "こんにちは"},
{"role": "assistant", "content": "やっほー!"},
{"role": "user", "content": "元気?"}
],
"temperature": 0.4,
"max_tokens": 100
}
Response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "元気だよ〜!"
}
}
]
}
Native API (Not Recommended)
{
"model": "...",
"system_prompt": "...",
"input": "ユーザー: こんにちは\nキャラ: やっほー\nユーザー: 元気?\nキャラ:"
}
Conversations are just concatenated as text, making it hard for the model to understand context. For character RP where conversational flow matters, use the OpenAI-compatible API.
Model Comparison and Selection
I tested several models against these requirements: NSFW support, Japanese language, and character RP capability.
| Model | NSFW | Japanese | Notes |
|---|---|---|---|
| huihui-mistral-abliterated | OK | Fair | Chinese text leaking in, inconsistent speech patterns |
| MS3.2-24B-Magnum-Diamond | OK | Fair | Currently using. Tends toward formal/refined speech, verbose |
| PaintedFantasy-v4-24B | ? | Bad | Japanese completely broken (romanized output) |
| Gemma 27B uncensored | No | - | Complete refusal: “I don’t want to talk about that” |
| Cydonia-24B-v4.1 | No | - | NSFW refusal |
| Umievo-Gleipnir-7B | No | OK | Natural Japanese but refuses NSFW |
Even abliterated/uncensored Gemma models have limits. The training data itself has NSFW content stripped out, so even with censorship removed, the model simply “doesn’t know” the content. More on this in the Gemma section of the memory allocation article.
MS3.2-24B-Magnum-Diamond came out on top for NSFW support, conversational coherence, and passable Japanese. The Magnum family is trained for English creative writing and roleplay, which gives it strong NSFW tolerance.
Final Configuration
| Item | Setting |
|---|---|
| Model | MS3.2-24B-Magnum-Diamond (GGUF Q4_K_M) |
| Runtime | LM Studio |
| Endpoint | /v1/chat/completions (OpenAI-compatible) |
| temperature | 0.4 |
| max_tokens | 100 |
Confirmed Working
- No blocks on NSFW words
- Understands conversational context
- Follows character settings in conversation
Speed
- ~11 tokens/s (GPU inference)
- Short replies come back in a few seconds
Remaining Issues
- Speech pattern problem: Magnum tends to slip into a refined/formal feminine Japanese style. System prompt instructions to avoid this don’t fully work
- Verbosity problem: Even with max_tokens=100, responses sometimes run to 3-4 sentences
- Japanese + NSFW compatibility: Haven’t found a model that nails both perfectly. Casual Japanese conversation is underrepresented in training data, so speech patterns tend to sound unnatural
Future candidates to test: Lumimaid-Magnum-v4-12B, Vecteus-v1, big-tiger-gemma-27b-v3-heretic-v2.
Ollama-era testing notes (2026/02/04)
Notes from when I was still testing with Ollama before migrating to LM Studio.
Qwen 32B vs 72B character test:
| Item | 32B | 72B |
|---|---|---|
| JSON format | Ignored | ✓ |
| Response length | Too long | ✓ Appropriate |
| Conversation naturalness | Formulaic | ✓ Natural |
| Instruction following | Weak | ✓ Good |
| NSFW support | ✓ | ✓ |
72B had great quality but at ~1.5 t/s on Ollama’s CPU inference, it wasn’t practical. Even on an M1 Max 64GB, it topped out at 5.3 t/s.
After migrating to LM Studio’s GPU inference, the 24B model hit 11 t/s, eliminating the need for the 72B.