AMD's Lemonade Local AI Server Bundles GPU, NPU, and Multi-Modal Inference Under One Roof
AMD has released Lemonade (official site, GitHub), a local AI server. It bundles not just LLM inference but also image generation (Stable Diffusion), speech recognition (Whisper), and text-to-speech (Kokoro TTS) into a single server, all accessible through an OpenAI-compatible API. The project gathered 518 points and 107 comments on Hacker News, with particularly positive reactions from Strix Halo users saying “local inference on AMD hardware finally works properly.”
I’ve been building a local LLM setup on my EVO-X2 (Strix Halo) and fighting with driver issues, VRAM allocation pitfalls, and Vulkan regressions ever since. Lemonade is positioned to absorb exactly this kind of hassle.
Where Lemonade Fits In
Lemonade is not an inference engine itself. Under the hood, it orchestrates existing backends like llama.cpp, FastFlowLM, whisper.cpp, stable-diffusion.cpp, and Kokoro, automatically selecting the optimal configuration for your hardware. It’s similar to LM Studio or Ollama, but differs in that it unifies modalities beyond text generation (images and audio).
In the HN comments, developer sawansri explained: “Lemonade is designed as a turnkey (optimized for AMD Hardware) for local AI models,” making it clear that raw performance matches vanilla llama.cpp but the setup friction is massively reduced.
The integration ecosystem is broad too. It works with VS Code Copilot, Open WebUI, n8n, Dify, OpenHands, Continue, GitHub Copilot, and more, connecting existing tools through OpenAI/Ollama/Anthropic-compatible endpoints.
Supported Backends and Hardware
For text generation alone, the following backends are available:
| Backend | API | Supported Devices | OS |
|---|---|---|---|
| llama.cpp (Vulkan) | Generic GPU | AMD iGPU/dGPU, x86_64 CPU | Windows, Linux |
| llama.cpp (ROCm) | ROCm | RDNA3/RDNA4/Strix Halo | Windows, Linux |
| llama.cpp (Metal) | Metal | Apple Silicon GPU | macOS (beta) |
| llama.cpp (CPU) | CPU instructions | x86_64 | Windows, Linux |
| FastFlowLM (FLM) | NPU | XDNA2 NPU | Windows, Linux |
| ryzenai-llm | NPU | XDNA2 NPU | Windows |
On top of these, whisper.cpp (speech recognition), stable-diffusion.cpp (image generation), and Kokoro (text-to-speech) each run as independent backends. Running the lemonade recipes command lists all available combinations for your machine.
ROCm-compatible GPUs:
| Architecture | Example GPU Models |
|---|---|
| gfx1151 (Strix Halo) | Ryzen AI MAX+ Pro 395 |
| gfx120X (RDNA4) | Radeon AI PRO R9700, RX 9070 XT/9070, RX 9060 XT |
| gfx110X (RDNA3) | Radeon PRO W7900/W7800, RX 7900 XTX/XT/GRE, RX 7800 XT |
The Reality of NPU
The most heated debate in the HN comments was around the usefulness of NPU (Neural Processing Unit). As things stand, NPU is underpowered for primary inference workloads and is positioned for low-power small models.
The XDNA2 architecture NPU in AMD’s Ryzen AI series claims up to 60 TOPS (AMD official), but an RTX 3050 alone exceeds the 40 TOPS that Microsoft requires for Copilot+ PC certification.
cpburns2009:
The NPU is entirely useless for the Framework Desktop,
and really all Strix Halo devices. Where it could be useful
is cell phones.
Where NPU Actually Helps
That said, it’s not completely useless.
One use case is running small models like Qwen3’s TTS model (under 2B) or Whisper’s speech recognition model (under 1B) continuously. You offload audio I/O to the NPU and free up GPU VRAM for LLM inference.
Another is prefill offloading (initial processing of prompts). Prefill for long contexts is computationally expensive and can hit GPU power limits. Lemonade offers a “hybrid execution” mode where the NPU handles part of the prefill while the GPU focuses on token generation. According to AMD’s official technical article, hybrid NPU+iGPU execution on the Ryzen AI 300 series optimizes power efficiency.
FastFlowLM’s NPU Kernels Are Proprietary Binaries
The actual NPU inference is handled by FastFlowLM. FastFlowLM aims for an Ollama-like user experience while being optimized specifically for AMD NPUs, shipping Linux support (v0.9.35) in March 2026 and supporting context lengths up to 256k tokens.
However, the NPU acceleration kernels are proprietary binaries. Non-commercial use is free, but the source code is not published. As HN commenter zozbot234 pointed out, AMD/Xilinx’s NPU software stack itself (iron, mlir-aie, RyzenAI-SW) is open source, but FastFlowLM’s model execution layer is closed. There was also discussion about whether open NPU kernels could be developed using Vulkan Compute.
graph TD
A[Lemonade Server] --> B[llama.cpp]
A --> C[FastFlowLM]
A --> D[whisper.cpp]
A --> E[stable-diffusion.cpp]
A --> F[Kokoro TTS]
B --> G[Vulkan GPU]
B --> H[ROCm GPU]
B --> I[Metal GPU]
B --> J[CPU]
C --> K[XDNA2 NPU]
C --> L["NPU Kernels<br/>(Proprietary)"]
D --> M[NPU / Vulkan / CPU]
E --> N[ROCm / CPU]
The ROCm vs Vulkan Problem
ROCm vs Vulkan performance is an unavoidable topic for local LLM on AMD hardware. I’ve dealt with Vulkan driver regressions myself, and the same discussion keeps coming up on HN.
Here’s the current landscape:
| Aspect | ROCm | Vulkan |
|---|---|---|
| Token generation speed (tg) | On par to slightly slower depending on GPU | Fast on RDNA2+; AMD officially uses Vulkan numbers in marketing |
| Prompt processing speed (pp) | Faster than Vulkan | Significantly slower than ROCm |
| Stability | Frequent regressions in the 7.x series | Relatively stable |
| Driver management | Sparse desktop GPU support | Works with stock OS drivers |
A user named lrvick reported “20%+ speedup confirmed with Vulkan + kernel 7.0.0,” and Vulkan continues to be favorable on Strix Halo in particular. However, ROCm is still faster for use cases that prioritize prefill speed (time-to-first-token).
Just as Ollama has moved to an MLX backend, the inference stack on Apple Silicon is getting sorted out, but the AMD side is a three-way battle between ROCm/Vulkan/NPU. It’s easy to see why an integration layer like Lemonade is needed.
Real-World Performance
Here are some actual benchmarks pulled from HN comments:
| User | Hardware | Model | Quantization | Speed | Backend |
|---|---|---|---|---|---|
| lrvick | Strix Halo 128GB | Qwen3.5-122B | Unknown | 35 t/s | Vulkan |
| cpburns2009 | Framework Desktop 128GB | Qwen3-Coder-Next | Q4 | 43 t/s | Unknown |
| cpburns2009 | Framework Desktop 128GB | Qwen3.5-35B-A3B | Q4 | 55 t/s | Unknown |
| rpdillon | Strix Halo | GPT OSS 120B | Unknown | 50 t/s | ROCm (llamacpp-rocm) |
| lrvick | Radeon 6900 XT | Qwen3.5-32B | Unknown | 60+ t/s | Unknown |
Qwen3.5-122B running at 35 t/s on a Strix Halo 128GB setup seems reasonable when you compare it to my EVO-X2 (64GB) pushing 53 t/s on Qwen3.5-35B-A3B Q6_K, factoring in the model size difference. It’s also striking how many Framework Desktop (128GB) users showed up — Strix Halo is clearly becoming the main target for local inference.
CLI Usage
The workflow after installation is straightforward:
# List available models
lemonade list
# Download a model
lemonade pull Gemma-3-4b-it-GGUF
# Run a model (chat)
lemonade run Gemma-3-4b-it-GGUF
# Image generation
lemonade run SDXL-Turbo
# Text-to-speech
lemonade run kokoro-v1
# Speech recognition
lemonade run Whisper-Large-v3-Turbo
# Check available backends for your hardware
lemonade recipes
The API endpoint is http://localhost:13305/api/v1 and works directly with OpenAI-compatible client libraries. Libraries are listed for Python, Node.js, Go, Rust, Java, C#, Ruby, and PHP.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:13305/api/v1",
api_key="lemonade" # Required but unused
)
completion = client.chat.completions.create(
model="Llama-3.2-1B-Instruct-Hybrid",
messages=[{"role": "user", "content": "Hello"}]
)
How It Differs from Ollama
Naturally the comparison with Ollama and LM Studio comes up, and the biggest differentiator is multi-modal integration. There’s virtually no other local tool that handles text, image, and audio in a single server. Normally, running image generation, speech recognition, and an LLM locally means standing up three separate services and managing three APIs. Lemonade consolidates all of that into one daemon.
The other differentiator is AMD hardware optimization — Lemonade handles ROCm builds and NPU support for you. On NVIDIA hardware there’s honestly not much benefit (there’s no official CUDA support, though apparently you can make it work by manually swapping in a different llama.cpp version).
Ollama and LM Studio still have advantages though: the breadth of the model ecosystem, the seamless GGUF pull experience, and the sheer community size. Lemonade has 2.1k GitHub stars compared to Ollama’s drastically larger scale. The split is clear: Lemonade for AMD users who want the full package, Ollama for those who want platform-agnostic simplicity.
Roadmap
| In Development | Under Consideration | Recently Completed |
|---|---|---|
| MLX support | vLLM support | macOS (beta) |
| Additional whisper.cpp backends | Custom model expansion | Image generation |
| Additional SD.cpp backends | Speech recognition & TTS | |
| App marketplace |
MLX support being in development is interesting. Given that Ollama has moved to MLX, if Lemonade migrates from its current macOS beta (llama.cpp Metal) to an MLX backend, it could become a serious option on Apple Silicon too. vLLM support would enable batch inference and production deployments with full OpenAI API compatibility.
Mobile apps (iOS/Android) are already available for connecting to local servers. The source code is also on GitHub.