Tech 6 min read

Setting Up a Local LLM on the GMKtec EVO-X2 (Strix Halo)

Update (2026-03-01): At the time of testing, my AMD driver was outdated, which caused Vulkan memory management and GPU inference issues with some models. Updating the driver to 26.2.2 or later fixes these problems. See The Reason Qwen 3.5 Completely Failed on Radeon 8060S Was the AMD Driver for details.

I wanted a local LLM for character RP that could handle NSFW content. Cloud APIs randomly block NSFW words, and I wanted full control over character speech patterns. I happened to have a GMKtec EVO-X2 (Strix Halo mini PC) on hand, so I decided to set up an LLM server on it.

Related articles:

Hardware: GMKtec EVO-X2

ItemSpec
CPUAMD Ryzen AI Max+ 395 (16 cores / 32 threads)
RAM64GB LPDDR5X (unified memory)
iGPURadeon 8060S

The key feature of Strix Halo is its unified memory architecture where CPU and GPU share the same memory pool. Same idea as Apple Silicon, and it works great for LLM inference.

Initial Troubles

Only 32GB of Memory Recognized

I should have had 64GB, but only 32GB was usable. Turns out the factory settings had reserved half the memory for the GPU.

Fix:

  1. Mash ESC during reboot to enter BIOS
  2. Advanced → GFX Configuration
  3. iGPU Configuration → UMA_SPECIFIED
  4. UMA Frame buffer size → Change to 2G
  5. F10 → Save & Reset

This brought usable memory up to 61.6GB. For details on VRAM allocation, see the memory allocation article.

Performance Mode Wasn’t Enabled

Running a 72B model only gave me 2 tokens/s. The power mode was set to Quiet/Balanced.

The EVO-X2 has a P-MODE button on the front (separate from the power button). Press it until you see the red meter icon (Performance mode).

Ollama vs LM Studio: LM Studio Is the Only Choice on Strix Halo

I tried Ollama first, but it doesn’t play nice with Strix Halo.

RuntimeLoadGPU InferenceNotes
LM StudioOKOKGPU inference works via Vulkan through shared memory
OllamaFail-Crashes on load (even without GPU)

The reason comes down to the backend:

  • LM Studio: Vulkan backend → correctly detects the Strix Halo GPU
  • Ollama: ROCm backend → doesn’t support Strix Halo (gfx1151)

When Ollama can’t find a supported GPU, it falls back to CPU inference, but sometimes it crashes during loading before even getting there. You can try overriding the version with the HSA_OVERRIDE_GFX_VERSION environment variable, but it’s unreliable.

Bottom line: there’s no reason to use Ollama on the EVO-X2 (Strix Halo).

LM Studio Setup

Installation and Server Startup

  1. Download and install the Windows version from lmstudio.ai
  2. Select “Local Server” from the left menu
  3. Load a model
  4. Click “Start Server” to launch the API server (default: localhost:1234)

External Access Configuration

Enable “Serve on Local Network” in LM Studio’s settings. This makes it listen on 0.0.0.0:1234, allowing connections via Tailscale.

Two Types of LM Studio API Endpoints

LM Studio has two types of API endpoints.

EndpointFormatPurpose
/v1/chat/completionsOpenAI-compatibleRecommended - conversation history is properly recognized
/api/v1/chatNative APIConversations are concatenated as text, losing context
{
  "model": "ms3.2-24b-magnum-diamond",
  "messages": [
    {"role": "system", "content": "システムプロンプト"},
    {"role": "user", "content": "こんにちは"},
    {"role": "assistant", "content": "やっほー!"},
    {"role": "user", "content": "元気?"}
  ],
  "temperature": 0.4,
  "max_tokens": 100
}

Response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "元気だよ〜!"
      }
    }
  ]
}
{
  "model": "...",
  "system_prompt": "...",
  "input": "ユーザー: こんにちは\nキャラ: やっほー\nユーザー: 元気?\nキャラ:"
}

Conversations are just concatenated as text, making it hard for the model to understand context. For character RP where conversational flow matters, use the OpenAI-compatible API.

Model Comparison and Selection

I tested several models against these requirements: NSFW support, Japanese language, and character RP capability.

ModelNSFWJapaneseNotes
huihui-mistral-abliteratedOKFairChinese text leaking in, inconsistent speech patterns
MS3.2-24B-Magnum-DiamondOKFairCurrently using. Tends toward formal/refined speech, verbose
PaintedFantasy-v4-24B?BadJapanese completely broken (romanized output)
Gemma 27B uncensoredNo-Complete refusal: “I don’t want to talk about that”
Cydonia-24B-v4.1No-NSFW refusal
Umievo-Gleipnir-7BNoOKNatural Japanese but refuses NSFW

Even abliterated/uncensored Gemma models have limits. The training data itself has NSFW content stripped out, so even with censorship removed, the model simply “doesn’t know” the content. More on this in the Gemma section of the memory allocation article.

MS3.2-24B-Magnum-Diamond came out on top for NSFW support, conversational coherence, and passable Japanese. The Magnum family is trained for English creative writing and roleplay, which gives it strong NSFW tolerance.

Final Configuration

ItemSetting
ModelMS3.2-24B-Magnum-Diamond (GGUF Q4_K_M)
RuntimeLM Studio
Endpoint/v1/chat/completions (OpenAI-compatible)
temperature0.4
max_tokens100

Confirmed Working

  • No blocks on NSFW words
  • Understands conversational context
  • Follows character settings in conversation

Speed

  • ~11 tokens/s (GPU inference)
  • Short replies come back in a few seconds

Remaining Issues

  1. Speech pattern problem: Magnum tends to slip into a refined/formal feminine Japanese style. System prompt instructions to avoid this don’t fully work
  2. Verbosity problem: Even with max_tokens=100, responses sometimes run to 3-4 sentences
  3. Japanese + NSFW compatibility: Haven’t found a model that nails both perfectly. Casual Japanese conversation is underrepresented in training data, so speech patterns tend to sound unnatural

Future candidates to test: Lumimaid-Magnum-v4-12B, Vecteus-v1, big-tiger-gemma-27b-v3-heretic-v2.

Ollama-era testing notes (2026/02/04)

Notes from when I was still testing with Ollama before migrating to LM Studio.

Qwen 32B vs 72B character test:

Item32B72B
JSON formatIgnored
Response lengthToo long✓ Appropriate
Conversation naturalnessFormulaic✓ Natural
Instruction followingWeak✓ Good
NSFW support

72B had great quality but at ~1.5 t/s on Ollama’s CPU inference, it wasn’t practical. Even on an M1 Max 64GB, it topped out at 5.3 t/s.

After migrating to LM Studio’s GPU inference, the 24B model hit 11 t/s, eliminating the need for the 72B.