#Inference Optimization

11 articles

TechJun 28, 20268 min

DeepSeek-V4-Pro-DSpark is not a new model but a speculative-decoding V4-Pro

DeepSeek-V4-Pro-DSpark isn't a new base model. It's the same 1.6T V4-Pro checkpoint plus a DSpark speculative-decoding head (~893GB). What config.json and the DeepSpec repo reveal, and why there's no speed benchmark yet.

LLM DeepSeek Chinese AI MoE Inference Optimization Open Model Speculative Decoding

TechMay 14, 202629 min

oMLX 0.3.9.dev2 tested on M1 Max 64GB: SSD cache wins, VLM MTP slower

Tested oMLX 0.3.9.dev2 on M1 Max 64GB across 11 scenarios: SSD KV cache cuts Copilot prefill 88s→33s, VLM MTP slows decode 12-30%, omlx launch reaches Copilot/Codex/Claude Code.

AI LLM Local LLM Apple Silicon MLX Inference Optimization Codex 実験

TechMay 13, 2026updated8 min

oMLX 0.3.9.dev2 for Mac coding agents: Gemma 4 VLM MTP, DFlash, launch copilot

oMLX 0.3.9.dev2 release notes from the angle of Codex/Copilot on Mac local LLMs: Gemma 4 VLM MTP, DFlash, omlx launch copilot, SSD KV cache — what each changes for agent workflows.

AI LLM Local LLM Apple Silicon MLX Inference Optimization Codex

TechApr 24, 20269 min

TRACER trains a surrogate from LLM classification API logs and swaps in via a parity gate

TRACER, a recent arXiv paper, takes the input/output logs of an LLM classification endpoint and reuses them as training data, then swaps in a lightweight surrogate only on regions that pass a parity gate to cut inference cost. The surrogate absorbs 83–100% of traffic on a 77-class intent dataset and 100% on a 150-class one, while correctly refusing to deploy on an NLI task — that refusal behavior is the interesting part.

AI LLM Machine Learning Paper Inference Optimization

TechApr 3, 20268 min

Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

Real-world testing of AMD Lemonade v10.0.1 on Ryzen AI Max+ 395. LLM, image generation, speech recognition, and TTS running simultaneously, NPU Hybrid execution, Vulkan vs ROCm benchmarks, and discovering shared memory leaks.

AMD Local LLM Vulkan ROCm NPU llama.cpp GPU Inference Optimization Benchmark Experiment

TechApr 3, 20268 min

AMD's Lemonade Local AI Server Bundles GPU, NPU, and Multi-Modal Inference Under One Roof

Lemonade is AMD's open-source local AI server that manages multiple backends like llama.cpp and FastFlowLM across GPU/NPU/CPU, serving text, image, and audio generation through an OpenAI-compatible API.

AMD Local LLM NPU GPU llama.cpp Inference Optimization ROCm Vulkan

TechApr 2, 2026updated13 min

SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.

Apple Silicon LLM MLX Local LLM Inference Optimization KV Cache MoE Swift

TechMar 31, 20266 min

Ollama Moves to MLX Backend, Dramatically Speeds Up Local Inference on Apple Silicon

Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.

Ollama MLX Apple Silicon LLM Local LLM Inference Optimization

TechMar 25, 202617 min

Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization

Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.

LLM Local LLM Quantization Apple Silicon Inference Optimization KV Cache Rust

TechMar 6, 202610 min

Back-to-back releases of OpenAI GPT-5.3/5.4 and Saguaro-driven inference speedups

A summary of GPT-5.3 Instant’s hallucination reductions and safety regressions, GPT-5.4’s computer use, Tool Search, and 1M-token context, plus Saguaro’s 5× inference speedups.

LLM OpenAI GPT Inference Optimization Speculative Decoding AI Safety Computer Use

TechFeb 20, 2026updated13 min

Accelerating LLM Inference: CDLM and Attention Matching KV Compaction

Two February 2026 papers on reducing inference cost: Together AI’s Consistency DLM (up to 14.5× faster) and MIT/Harvard’s Attention Matching KV compaction (50× compaction in seconds).

AI LLM Inference Optimization KV Cache Diffusion models