Lemonade is AMD's open-source local AI server that manages multiple backends like llama.cpp and FastFlowLM across GPU/NPU/CPU, serving text, image, and audio generation through an OpenAI-compatible API.
Google DeepMind has released Gemma 4: four models—31B dense, 26B MoE (A4B), E4B, and E2B—with a 256K context, multimodal input, tool calling, and support for 140 languages.
SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.
Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.
Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers use KV cache. Expanding ctx-size from 4096 to 65536 on llama-server added just 800MB VRAM with zero speed loss. Includes q8_0 KV quantization benchmarks and TurboQuant status.
Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.
Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.
H Company's Holotron-12B uses a memory-efficient new design to lift PC-operation AI throughput to 8,900 tokens per second. Unsloth has released the beta of 'Studio,' a browser tool for no-code model fine-tuning.
All variants of huihui-ai's Qwen 3.5 abliterated produced garbage tokens. GLM-4.7-Flash abliterated had a broken chat template. The official version with thinking disabled turned out to be the right answer.