Tech 8 min read

Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

In my previous overview article, I covered AMD’s official local AI server Lemonade based on HN comments and official docs. I’ve now actually run it on my EVO-X2 (Ryzen AI Max+ 395 / Radeon 8060S), and here are the results.

Test Environment

ItemValue
PCGMKtec EVO-X2
CPU/GPURyzen AI Max+ 395 / Radeon 8060S (gfx1151)
Memory64GB UMA
Lemonadev10.0.1 (bundled llama.cpp b8460)
DriverAMD Software 26.3.1

BIOS settings were toggled between 32GB/32GB and 48GB/16GB depending on the test. What these allocations mean is detailed in my VRAM allocation article, and as noted in the Vulkan regression article, the BIOS allocation directly impacts performance under AMD driver 26.3.1.

As a baseline, llama-server b8183 direct (no Lemonade) running Qwen3.5-35B-A3B Q6_K (26.55GB) was previously benchmarked at 53.6 t/s with ctx=65536 and q8_0 KV cache.

Overhead Through Lemonade

The first question: does routing through Lemonade slow things down?

SetupModelctxSpeed
Lemonade v10.0.1 (b8460)Q4_K_XL (21.2GB)409657.9 t/s
llama-server b8183 directQ6_K (26.55GB)409653.7 t/s
llama-server b8183 directQ4_K_M (19.7GB)409654.9 t/s

Lemonade appears faster, but this isn’t a fair comparison since the quantization differs. lemonade pull delivers Q4_K_XL (21.2GB) by default, while my direct benchmarks used Q6_K (26.55GB). Different quantization methods produce different model sizes and inference characteristics.

Lemonade’s internal architecture is straightforward:

  • lemonade-router.exe (port 8000: Web UI, port 9000: API proxy)
  • llama-server (port 8001: inference engine, b8460)

The actual inference runs on llama-server; the router is just a proxy. The developer’s claim of “equivalent to raw llama.cpp” holds up. Overhead is effectively zero.

Custom arguments can be passed through --llamacpp-args:

lemonade run Qwen3.5-35B-A3B-GGUF --ctx-size 65536 \
  --llamacpp-args "--cache-type-k q8_0 --cache-type-v q8_0 --no-mmap --reasoning-budget 0"

This yielded 63.5 t/s with ctx=65536 and q8_0 KV. You can carry over your existing llama-server configuration as-is.

NPU Hybrid Execution

The overview article mentioned “NPU can offload prefill.” I tested this with an actual Hybrid model.

Running lemonade run Qwen3-8B-Hybrid triggered an automatic download of RyzenAI-Server v1.7.0 (542MB).

ItemValue
Backendryzenai-llm (FastFlowLM)
ModelQwen3-8B (AWQ quantized ONNX)
TTFT0.44s
Generation speed24.0 t/s (300 tokens)

For comparison, the 35B model on Vulkan runs at 57.9 t/s, so NPU Hybrid’s 8B model is less than half that speed. However, the 0.44s TTFT is notably fast. Considering that prefill on Vulkan for large models takes several seconds, NPU Hybrid has a clear advantage for low-latency use cases with smaller models. Since the NPU doesn’t consume GPU VRAM, a practical split would be running a large LLM on the GPU while keeping smaller speech models resident on the NPU.

Note: A Hybrid version of Qwen3.5-35B-A3B wasn’t listed. NPU support is limited to smaller models (8B and under).

Vulkan vs ROCm: Discovering Shared Memory Leaks

This was the biggest finding of this testing session.

With BIOS set to 48GB VRAM / 16GB system, I compared Vulkan (mmap) against ROCm:

BackendSpeedDedicated GPUShared MemorySystem RAM remaining
Vulkan (mmap)46 t/s16.1 GB7.1 GB0.8 GB
ROCm37.8 t/s21.6 GB3.3 GB6.7 GB

Vulkan is 18% faster, but it’s leaking 7.1GB into shared memory. About 5GB of the Q4_K_XL (21GB) model is being placed in system RAM instead of dedicated GPU memory. With 48/16 BIOS, 7.5GB out of the 16GB system RAM goes to GPU shared memory, leaving only 0.8GB for the OS + Parsec + Tailscale. That’s dangerously low.

ROCm cuts the shared memory leak in half to 3.3GB, leaving 6.7GB of headroom in system RAM. Speed drops, but memory health is dramatically better.

This shared memory leak is specific to AMD driver 26.3.1 + Vulkan. It doesn’t occur with the ROCm backend. Image generation (sd-cpp / ROCm) later confirmed shared memory dropping to just 0.9GB, providing further evidence that the issue lies in Vulkan’s memory placement logic.

4-Model Simultaneous Startup

I tested Lemonade’s headline feature: multi-modality simultaneous operation. Four models running concurrently on 48/16 BIOS with the ROCm backend:

ModelBackendPort
Qwen3.5-35B-A3B Q4_K_XLllama.cpp ROCm8001
Whisper-Smallwhisper.cpp + NPU8002
Kokoro-v1 TTSkokoro (ONNX)8003
Flux-2-Klein-4Bsd-cpp ROCm8004
MetricValue
Dedicated GPU37.6 GB / 48 GB (10.4 GB remaining)
Shared GPU3.4 GB
System RAM67% (approx. 5.3 GB remaining)
LLM speed37.7 t/s

10GB of VRAM to spare, 5GB of system RAM to spare. LLM speed barely dropped from single-model operation (37.8 t/s). Whisper + Kokoro combined only consume about +0.3GB, so their impact is minimal.

Running LLM + image generation + speech recognition + TTS on a single server simultaneously delivered on the “biggest differentiator from Ollama” I described in the overview article.

mmap and BIOS Allocation Tradeoffs

mmap (the default) was essential for 4-model simultaneous startup. With --no-mmap, the entire file is loaded into system RAM at startup, which overflows the 16GB in a 48/16 configuration and causes ctx=65536 loading to fail. mmap loads pages on demand, keeping peak memory usage lower.

However, mmap carries a speed penalty:

SetupctxSpeed
32/32, —no-mmap, Q6_K6553653.6 t/s
48/16, mmap, Q4_K_XL6553646.3 t/s
48/16, —no-mmap, Q4_K_XL409657.8 t/s

mmap runs at 46 t/s vs 57 t/s with —no-mmap. The difference likely comes from shared memory access patterns on UMA.

BIOSBest forConstraints
48/16Multi-modality simultaneous operationctx=65536 requires mmap (speed penalty); —no-mmap limits to ctx=4096
32/32Single LLM with long contextLimited VRAM headroom makes multi-modality difficult

Image Generation: Flux-2-Klein-4B

After stopping all LLM/Whisper/Kokoro models, I ran Flux Klein 4B standalone.

ItemValue
Backendsd-cpp (ROCm), automatically selected over Vulkan
Model size15.4 GB (text encoder 7.7GB + diffusion 7.4GB + VAE 0.3GB)
Dedicated GPU16.8 GB
Shared memory0.9 GB
Generation timeApprox. 2 min (512x512, 1 image)

With the LLM (Vulkan) unloaded, shared memory dropped from 7.1GB to 0.9GB. sd-cpp’s ROCm backend correctly places the model in dedicated GPU memory. This confirms the shared memory leak is Vulkan + driver 26.3.1 specific.

Results:

Cat image generated with Flux-2-Klein-4B. A tabby cat sitting by a window, photorealistic quality

Anime-style illustration generated with Flux-2-Klein-4B. A black-haired girl standing under cherry blossoms

Photorealistic and anime styles both rendered without issues at 512x512. However, at approximately 2 minutes per image, it’s not particularly fast. Performance is comparable to running ComfyUI on an RTX 4060 Laptop. For batch generation or rapid iteration, it’s not practical yet.

Roadblocks

Router proxy not functioning

Lemonade’s router (lemonade-router.exe) is designed to provide an API proxy on port 9000, but it didn’t respond to chat completion requests at all.

EndpointBindLocalTailscale
router (9000)0.0.0.0No responseNo response
llama-server (8001)127.0.0.1WorksInaccessible

llama-server (8001) binds to 127.0.0.1 so it’s inaccessible from outside, and router (9000) doesn’t respond. Same behavior after a clean 48/16 reboot, unrelated to any previous llama-server sessions. It’s a router bug.

I already run a Tailscale-based API exposure setup, but Lemonade alone can’t support this. Direct llama-server with --host 0.0.0.0 is more reliable.

recipes command returns 500 errors

lemonade recipes consistently returned 500 errors. The logs revealed the cause:

Backend availability:
  - NPU hardware: Yes
  - System RAM: 64.0 GB (max model size: 51.2 GB)

No GPU detected. Only NPU and RAM are listed; the Radeon 8060S (Vulkan GPU) doesn’t appear in Backend availability.

Strix Halo is a UMA APU with an integrated GPU, so it likely fails Lemonade’s discrete GPU detection logic. Inference works because llama.cpp directly accesses the Vulkan device, but Lemonade’s management layer doesn’t recognize the GPU. Discrete GPUs like the RX 7900 XTX probably work fine.

Switching to 48/16 BIOS caused even NPU detection to fail, making things worse. NPU showed as “Yes” on 32/32, so BIOS allocation affects backend detection too.

No model quantization choice with pull

lemonade pull Qwen3.5-35B-A3B-GGUF downloads Q4_K_XL only. There’s no way to specify Q6_K or Q8_0. Using custom GGUFs requires specifying a custom path, but the procedure isn’t clearly documented.

Vulkan vs ROCm: Which One to Choose?

VulkanROCm
LLM speed46 t/s37.8 t/s
Shared memory leak7.1 GB3.3 GB
System RAM remaining (4 models)Critical (0.8 GB)5.3 GB
Multi-modalityMemory-limitedStable
Image generationNot testedROCm auto-selected, works correctly

For API operation + multi-modality, ROCm is the right choice. Speed drops 18%, but memory health is dramatically better and 4-model simultaneous operation is stable. If single-LLM speed is the top priority, Vulkan + 32/32 BIOS + direct llama-server is faster.

Lemonade’s value lies in its “all-in-one” convenience. If you’re leveraging multi-modality, ROCm is the only viable option.


At v10.0.1, Strix Halo UMA-specific issues with GPU detection and router are noticeable, but getting stable 4-model simultaneous operation on ROCm was a solid result.