Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

In my previous overview article, I covered AMD’s official local AI server Lemonade based on HN comments and official docs. I’ve now actually run it on my EVO-X2 (Ryzen AI Max+ 395 / Radeon 8060S), and here are the results.

Test Environment

Item	Value
PC	GMKtec EVO-X2
CPU/GPU	Ryzen AI Max+ 395 / Radeon 8060S (gfx1151)
Memory	64GB UMA
Lemonade	v10.0.1 (bundled llama.cpp b8460)
Driver	AMD Software 26.3.1

BIOS settings were toggled between 32GB/32GB and 48GB/16GB depending on the test. What these allocations mean is detailed in my VRAM allocation article, and as noted in the Vulkan regression article, the BIOS allocation directly impacts performance under AMD driver 26.3.1.

As a baseline, llama-server b8183 direct (no Lemonade) running Qwen3.5-35B-A3B Q6_K (26.55GB) was previously benchmarked at 53.6 t/s with ctx=65536 and q8_0 KV cache.

Overhead Through Lemonade

The first question: does routing through Lemonade slow things down?

Setup	Model	ctx	Speed
Lemonade v10.0.1 (b8460)	Q4_K_XL (21.2GB)	4096	57.9 t/s
llama-server b8183 direct	Q6_K (26.55GB)	4096	53.7 t/s
llama-server b8183 direct	Q4_K_M (19.7GB)	4096	54.9 t/s

Lemonade appears faster, but this isn’t a fair comparison since the quantization differs. lemonade pull delivers Q4_K_XL (21.2GB) by default, while my direct benchmarks used Q6_K (26.55GB). Different quantization methods produce different model sizes and inference characteristics.

Lemonade’s internal architecture is straightforward:

lemonade-router.exe (port 8000: Web UI, port 9000: API proxy)
llama-server (port 8001: inference engine, b8460)

The actual inference runs on llama-server; the router is just a proxy. The developer’s claim of “equivalent to raw llama.cpp” holds up. Overhead is effectively zero.

Custom arguments can be passed through --llamacpp-args:

lemonade run Qwen3.5-35B-A3B-GGUF --ctx-size 65536 \
  --llamacpp-args "--cache-type-k q8_0 --cache-type-v q8_0 --no-mmap --reasoning-budget 0"

This yielded 63.5 t/s with ctx=65536 and q8_0 KV. You can carry over your existing llama-server configuration as-is.

NPU Hybrid Execution

The overview article mentioned “NPU can offload prefill.” I tested this with an actual Hybrid model.

Running lemonade run Qwen3-8B-Hybrid triggered an automatic download of RyzenAI-Server v1.7.0 (542MB).

Item	Value
Backend	ryzenai-llm (FastFlowLM)
Model	Qwen3-8B (AWQ quantized ONNX)
TTFT	0.44s
Generation speed	24.0 t/s (300 tokens)

For comparison, the 35B model on Vulkan runs at 57.9 t/s, so NPU Hybrid’s 8B model is less than half that speed. However, the 0.44s TTFT is notably fast. Considering that prefill on Vulkan for large models takes several seconds, NPU Hybrid has a clear advantage for low-latency use cases with smaller models. Since the NPU doesn’t consume GPU VRAM, a practical split would be running a large LLM on the GPU while keeping smaller speech models resident on the NPU.

Note: A Hybrid version of Qwen3.5-35B-A3B wasn’t listed. NPU support is limited to smaller models (8B and under).

Vulkan vs ROCm: Discovering Shared Memory Leaks

This was the biggest finding of this testing session.

With BIOS set to 48GB VRAM / 16GB system, I compared Vulkan (mmap) against ROCm:

Backend	Speed	Dedicated GPU	Shared Memory	System RAM remaining
Vulkan (mmap)	46 t/s	16.1 GB	7.1 GB	0.8 GB
ROCm	37.8 t/s	21.6 GB	3.3 GB	6.7 GB

Vulkan is 18% faster, but it’s leaking 7.1GB into shared memory. About 5GB of the Q4_K_XL (21GB) model is being placed in system RAM instead of dedicated GPU memory. With 48/16 BIOS, 7.5GB out of the 16GB system RAM goes to GPU shared memory, leaving only 0.8GB for the OS + Parsec + Tailscale. That’s dangerously low.

ROCm cuts the shared memory leak in half to 3.3GB, leaving 6.7GB of headroom in system RAM. Speed drops, but memory health is dramatically better.

This shared memory leak is specific to AMD driver 26.3.1 + Vulkan. It doesn’t occur with the ROCm backend. Image generation (sd-cpp / ROCm) later confirmed shared memory dropping to just 0.9GB, providing further evidence that the issue lies in Vulkan’s memory placement logic.

4-Model Simultaneous Startup

I tested Lemonade’s headline feature: multi-modality simultaneous operation. Four models running concurrently on 48/16 BIOS with the ROCm backend:

Model	Backend	Port
Qwen3.5-35B-A3B Q4_K_XL	llama.cpp ROCm	8001
Whisper-Small	whisper.cpp + NPU	8002
Kokoro-v1 TTS	kokoro (ONNX)	8003
Flux-2-Klein-4B	sd-cpp ROCm	8004

Metric	Value
Dedicated GPU	37.6 GB / 48 GB (10.4 GB remaining)
Shared GPU	3.4 GB
System RAM	67% (approx. 5.3 GB remaining)
LLM speed	37.7 t/s

10GB of VRAM to spare, 5GB of system RAM to spare. LLM speed barely dropped from single-model operation (37.8 t/s). Whisper + Kokoro combined only consume about +0.3GB, so their impact is minimal.

Running LLM + image generation + speech recognition + TTS on a single server simultaneously delivered on the “biggest differentiator from Ollama” I described in the overview article.

mmap and BIOS Allocation Tradeoffs

mmap (the default) was essential for 4-model simultaneous startup. With --no-mmap, the entire file is loaded into system RAM at startup, which overflows the 16GB in a 48/16 configuration and causes ctx=65536 loading to fail. mmap loads pages on demand, keeping peak memory usage lower.

However, mmap carries a speed penalty:

Setup	ctx	Speed
32/32, —no-mmap, Q6_K	65536	53.6 t/s
48/16, mmap, Q4_K_XL	65536	46.3 t/s
48/16, —no-mmap, Q4_K_XL	4096	57.8 t/s

mmap runs at 46 t/s vs 57 t/s with —no-mmap. The difference likely comes from shared memory access patterns on UMA.

BIOS	Best for	Constraints
48/16	Multi-modality simultaneous operation	ctx=65536 requires mmap (speed penalty); —no-mmap limits to ctx=4096
32/32	Single LLM with long context	Limited VRAM headroom makes multi-modality difficult

Image Generation: Flux-2-Klein-4B

After stopping all LLM/Whisper/Kokoro models, I ran Flux Klein 4B standalone.

Item	Value
Backend	sd-cpp (ROCm), automatically selected over Vulkan
Model size	15.4 GB (text encoder 7.7GB + diffusion 7.4GB + VAE 0.3GB)
Dedicated GPU	16.8 GB
Shared memory	0.9 GB
Generation time	Approx. 2 min (512x512, 1 image)

With the LLM (Vulkan) unloaded, shared memory dropped from 7.1GB to 0.9GB. sd-cpp’s ROCm backend correctly places the model in dedicated GPU memory. This confirms the shared memory leak is Vulkan + driver 26.3.1 specific.

Results:

Cat image generated with Flux-2-Klein-4B. A tabby cat sitting by a window, photorealistic quality

Anime-style illustration generated with Flux-2-Klein-4B. A black-haired girl standing under cherry blossoms

Photorealistic and anime styles both rendered without issues at 512x512. However, at approximately 2 minutes per image, it’s not particularly fast. Performance is comparable to running ComfyUI on an RTX 4060 Laptop. For batch generation or rapid iteration, it’s not practical yet.

Roadblocks

Router proxy not functioning

Lemonade’s router (lemonade-router.exe) is designed to provide an API proxy on port 9000, but it didn’t respond to chat completion requests at all.

Endpoint	Bind	Local	Tailscale
router (9000)	0.0.0.0	No response	No response
llama-server (8001)	127.0.0.1	Works	Inaccessible

llama-server (8001) binds to 127.0.0.1 so it’s inaccessible from outside, and router (9000) doesn’t respond. Same behavior after a clean 48/16 reboot, unrelated to any previous llama-server sessions. It’s a router bug.

I already run a Tailscale-based API exposure setup, but Lemonade alone can’t support this. Direct llama-server with --host 0.0.0.0 is more reliable.

recipes command returns 500 errors

lemonade recipes consistently returned 500 errors. The logs revealed the cause:

Backend availability:
  - NPU hardware: Yes
  - System RAM: 64.0 GB (max model size: 51.2 GB)

No GPU detected. Only NPU and RAM are listed; the Radeon 8060S (Vulkan GPU) doesn’t appear in Backend availability.

Strix Halo is a UMA APU with an integrated GPU, so it likely fails Lemonade’s discrete GPU detection logic. Inference works because llama.cpp directly accesses the Vulkan device, but Lemonade’s management layer doesn’t recognize the GPU. Discrete GPUs like the RX 7900 XTX probably work fine.

Switching to 48/16 BIOS caused even NPU detection to fail, making things worse. NPU showed as “Yes” on 32/32, so BIOS allocation affects backend detection too.

No model quantization choice with pull

lemonade pull Qwen3.5-35B-A3B-GGUF downloads Q4_K_XL only. There’s no way to specify Q6_K or Q8_0. Using custom GGUFs requires specifying a custom path, but the procedure isn’t clearly documented.

Vulkan vs ROCm: Which One to Choose?

	Vulkan	ROCm
LLM speed	46 t/s	37.8 t/s
Shared memory leak	7.1 GB	3.3 GB
System RAM remaining (4 models)	Critical (0.8 GB)	5.3 GB
Multi-modality	Memory-limited	Stable
Image generation	Not tested	ROCm auto-selected, works correctly

For API operation + multi-modality, ROCm is the right choice. Speed drops 18%, but memory health is dramatically better and 4-model simultaneous operation is stable. If single-LLM speed is the top priority, Vulkan + 32/32 BIOS + direct llama-server is faster.

Lemonade’s value lies in its “all-in-one” convenience. If you’re leveraging multi-modality, ROCm is the only viable option.

At v10.0.1, Strix Halo UMA-specific issues with GPU detection and router are noticeable, but getting stable 4-model simultaneous operation on ROCm was a solid result.