AMD's Lemonade Local AI Server Bundles GPU, NPU, and Multi-Modal Inference Under One Roof

AMD has released Lemonade (official site, GitHub), a local AI server. It bundles not just LLM inference but also image generation (Stable Diffusion), speech recognition (Whisper), and text-to-speech (Kokoro TTS) into a single server, all accessible through an OpenAI-compatible API. The project gathered 518 points and 107 comments on Hacker News, with particularly positive reactions from Strix Halo users saying “local inference on AMD hardware finally works properly.”

I’ve been building a local LLM setup on my EVO-X2 (Strix Halo) and fighting with driver issues, VRAM allocation pitfalls, and Vulkan regressions ever since. Lemonade is positioned to absorb exactly this kind of hassle.

Where Lemonade Fits In

Lemonade is not an inference engine itself. Under the hood, it orchestrates existing backends like llama.cpp, FastFlowLM, whisper.cpp, stable-diffusion.cpp, and Kokoro, automatically selecting the optimal configuration for your hardware. It’s similar to LM Studio or Ollama, but differs in that it unifies modalities beyond text generation (images and audio).

In the HN comments, developer sawansri explained: “Lemonade is designed as a turnkey (optimized for AMD Hardware) for local AI models,” making it clear that raw performance matches vanilla llama.cpp but the setup friction is massively reduced.

The integration ecosystem is broad too. It works with VS Code Copilot, Open WebUI, n8n, Dify, OpenHands, Continue, GitHub Copilot, and more, connecting existing tools through OpenAI/Ollama/Anthropic-compatible endpoints.

Supported Backends and Hardware

For text generation alone, the following backends are available:

Backend	API	Supported Devices	OS
llama.cpp (Vulkan)	Generic GPU	AMD iGPU/dGPU, x86_64 CPU	Windows, Linux
llama.cpp (ROCm)	ROCm	RDNA3/RDNA4/Strix Halo	Windows, Linux
llama.cpp (Metal)	Metal	Apple Silicon GPU	macOS (beta)
llama.cpp (CPU)	CPU instructions	x86_64	Windows, Linux
FastFlowLM (FLM)	NPU	XDNA2 NPU	Windows, Linux
ryzenai-llm	NPU	XDNA2 NPU	Windows

On top of these, whisper.cpp (speech recognition), stable-diffusion.cpp (image generation), and Kokoro (text-to-speech) each run as independent backends. Running the lemonade recipes command lists all available combinations for your machine.

ROCm-compatible GPUs:

Architecture	Example GPU Models
gfx1151 (Strix Halo)	Ryzen AI MAX+ Pro 395
gfx120X (RDNA4)	Radeon AI PRO R9700, RX 9070 XT/9070, RX 9060 XT
gfx110X (RDNA3)	Radeon PRO W7900/W7800, RX 7900 XTX/XT/GRE, RX 7800 XT

The Reality of NPU

The most heated debate in the HN comments was around the usefulness of NPU (Neural Processing Unit). As things stand, NPU is underpowered for primary inference workloads and is positioned for low-power small models.

The XDNA2 architecture NPU in AMD’s Ryzen AI series claims up to 60 TOPS (AMD official), but an RTX 3050 alone exceeds the 40 TOPS that Microsoft requires for Copilot+ PC certification.

cpburns2009:
The NPU is entirely useless for the Framework Desktop,
and really all Strix Halo devices. Where it could be useful
is cell phones.

Where NPU Actually Helps

That said, it’s not completely useless.

One use case is running small models like Qwen3’s TTS model (under 2B) or Whisper’s speech recognition model (under 1B) continuously. You offload audio I/O to the NPU and free up GPU VRAM for LLM inference.

Another is prefill offloading (initial processing of prompts). Prefill for long contexts is computationally expensive and can hit GPU power limits. Lemonade offers a “hybrid execution” mode where the NPU handles part of the prefill while the GPU focuses on token generation. According to AMD’s official technical article, hybrid NPU+iGPU execution on the Ryzen AI 300 series optimizes power efficiency.

FastFlowLM’s NPU Kernels Are Proprietary Binaries

The actual NPU inference is handled by FastFlowLM. FastFlowLM aims for an Ollama-like user experience while being optimized specifically for AMD NPUs, shipping Linux support (v0.9.35) in March 2026 and supporting context lengths up to 256k tokens.

However, the NPU acceleration kernels are proprietary binaries. Non-commercial use is free, but the source code is not published. As HN commenter zozbot234 pointed out, AMD/Xilinx’s NPU software stack itself (iron, mlir-aie, RyzenAI-SW) is open source, but FastFlowLM’s model execution layer is closed. There was also discussion about whether open NPU kernels could be developed using Vulkan Compute.

graph TD
    A[Lemonade Server] --> B[llama.cpp]
    A --> C[FastFlowLM]
    A --> D[whisper.cpp]
    A --> E[stable-diffusion.cpp]
    A --> F[Kokoro TTS]
    B --> G[Vulkan GPU]
    B --> H[ROCm GPU]
    B --> I[Metal GPU]
    B --> J[CPU]
    C --> K[XDNA2 NPU]
    C --> L["NPU Kernels<br/>(Proprietary)"]
    D --> M[NPU / Vulkan / CPU]
    E --> N[ROCm / CPU]

The ROCm vs Vulkan Problem

ROCm vs Vulkan performance is an unavoidable topic for local LLM on AMD hardware. I’ve dealt with Vulkan driver regressions myself, and the same discussion keeps coming up on HN.

Here’s the current landscape:

Aspect	ROCm	Vulkan
Token generation speed (tg)	On par to slightly slower depending on GPU	Fast on RDNA2+; AMD officially uses Vulkan numbers in marketing
Prompt processing speed (pp)	Faster than Vulkan	Significantly slower than ROCm
Stability	Frequent regressions in the 7.x series	Relatively stable
Driver management	Sparse desktop GPU support	Works with stock OS drivers

A user named lrvick reported “20%+ speedup confirmed with Vulkan + kernel 7.0.0,” and Vulkan continues to be favorable on Strix Halo in particular. However, ROCm is still faster for use cases that prioritize prefill speed (time-to-first-token).

Just as Ollama has moved to an MLX backend, the inference stack on Apple Silicon is getting sorted out, but the AMD side is a three-way battle between ROCm/Vulkan/NPU. It’s easy to see why an integration layer like Lemonade is needed.

Real-World Performance

Here are some actual benchmarks pulled from HN comments:

User	Hardware	Model	Quantization	Speed	Backend
lrvick	Strix Halo 128GB	Qwen3.5-122B	Unknown	35 t/s	Vulkan
cpburns2009	Framework Desktop 128GB	Qwen3-Coder-Next	Q4	43 t/s	Unknown
cpburns2009	Framework Desktop 128GB	Qwen3.5-35B-A3B	Q4	55 t/s	Unknown
rpdillon	Strix Halo	GPT OSS 120B	Unknown	50 t/s	ROCm (llamacpp-rocm)
lrvick	Radeon 6900 XT	Qwen3.5-32B	Unknown	60+ t/s	Unknown

Qwen3.5-122B running at 35 t/s on a Strix Halo 128GB setup seems reasonable when you compare it to my EVO-X2 (64GB) pushing 53 t/s on Qwen3.5-35B-A3B Q6_K, factoring in the model size difference. It’s also striking how many Framework Desktop (128GB) users showed up — Strix Halo is clearly becoming the main target for local inference.

CLI Usage

The workflow after installation is straightforward:

# List available models
lemonade list

# Download a model
lemonade pull Gemma-3-4b-it-GGUF

# Run a model (chat)
lemonade run Gemma-3-4b-it-GGUF

# Image generation
lemonade run SDXL-Turbo

# Text-to-speech
lemonade run kokoro-v1

# Speech recognition
lemonade run Whisper-Large-v3-Turbo

# Check available backends for your hardware
lemonade recipes

The API endpoint is http://localhost:13305/api/v1 and works directly with OpenAI-compatible client libraries. Libraries are listed for Python, Node.js, Go, Rust, Java, C#, Ruby, and PHP.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/api/v1",
    api_key="lemonade"  # Required but unused
)

completion = client.chat.completions.create(
    model="Llama-3.2-1B-Instruct-Hybrid",
    messages=[{"role": "user", "content": "Hello"}]
)

How It Differs from Ollama

Naturally the comparison with Ollama and LM Studio comes up, and the biggest differentiator is multi-modal integration. There’s virtually no other local tool that handles text, image, and audio in a single server. Normally, running image generation, speech recognition, and an LLM locally means standing up three separate services and managing three APIs. Lemonade consolidates all of that into one daemon.

The other differentiator is AMD hardware optimization — Lemonade handles ROCm builds and NPU support for you. On NVIDIA hardware there’s honestly not much benefit (there’s no official CUDA support, though apparently you can make it work by manually swapping in a different llama.cpp version).

Ollama and LM Studio still have advantages though: the breadth of the model ecosystem, the seamless GGUF pull experience, and the sheer community size. Lemonade has 2.1k GitHub stars compared to Ollama’s drastically larger scale. The split is clear: Lemonade for AMD users who want the full package, Ollama for those who want platform-agnostic simplicity.

Roadmap

In Development	Under Consideration	Recently Completed
MLX support	vLLM support	macOS (beta)
Additional whisper.cpp backends	Custom model expansion	Image generation
Additional SD.cpp backends		Speech recognition & TTS
		App marketplace

MLX support being in development is interesting. Given that Ollama has moved to MLX, if Lemonade migrates from its current macOS beta (llama.cpp Metal) to an MLX backend, it could become a serious option on Apple Silicon too. vLLM support would enable batch inference and production deployments with full OpenAI API compatibility.

Mobile apps (iOS/Android) are already available for connecting to local servers. The source code is also on GitHub.