#MLX

13 articles

Tech May 14, 2026 29 min

oMLX 0.3.9.dev2 tested on M1 Max 64GB: SSD cache wins, VLM MTP slower

Tested oMLX 0.3.9.dev2 on M1 Max 64GB across 11 scenarios: SSD KV cache cuts Copilot prefill 88s→33s, VLM MTP slows decode 12-30%, omlx launch reaches Copilot/Codex/Claude Code.

AI LLM Local LLM Apple Silicon MLX Inference Optimization Codex 実験

Tech May 13, 2026 updated 8 min

oMLX 0.3.9.dev2 for Mac coding agents: Gemma 4 VLM MTP, DFlash, launch copilot

oMLX 0.3.9.dev2 release notes from the angle of Codex/Copilot on Mac local LLMs: Gemma 4 VLM MTP, DFlash, omlx launch copilot, SSD KV cache — what each changes for agent workflows.

AI LLM Local LLM Apple Silicon MLX Inference Optimization Codex

Tech May 8, 2026 11 min

FLUX.2 Klein 9B + NSFW LoRA on M1 Max 64GB via mflux: 1m51s/512, 5m37s/1024 q4

Tested Klein 9B + 9B NSFW LoRA on M1 Max 64GB via mflux 0.17.5: 1m51s/512, 5m37s/1024 q4, 224/224 LoRA keys match, NSFW prompts uncensored, Japanese subjects work with helper tokens.

AI 画像生成 FLUX Apple Silicon Mac MLX LoRA 実験

Tech May 7, 2026 8 min

Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

Tested Gemma 4 MTP drafter on M1 Max 64GB with mlx-vlm 0.5.0. Only the 26B A4B MoE got +13%; 31B Dense and E4B got slower. Code gen vs short haiku prompts flip the result.

AI LLM Google Gemma ローカルLLM 推論 MLX 実験

Tech May 4, 2026 updated 13 min

Can FLUX.2 Klein NSFW LoRAs actually run on an M1 Max?

Investigated whether NSFW LoRAs for FLUX.2 Klein 9B can run on M1 Max 64GB. Covers model compatibility, LoRA application paths, RunPod verification strategy, and VRAM requirements for training your own LoRA with ai-toolkit.

AI 画像生成 FLUX Apple Silicon Mac MLX LoRA 実験

Tech Apr 30, 2026 updated 12 min

FLUX.2 Klein 4B benchmarked on M1 Max with mflux vs iris.c

Hands-on benchmark of FLUX.2 Klein 4B on M1 Max 64GB using mflux (MLX) and iris.c (pure C + Metal). A counter to Pruna AI's H100-only tutorial — measuring how fast Apple Silicon actually gets there.

AI 画像生成 FLUX Apple Silicon Mac MLX 実験

Tech Apr 25, 2026 updated 11 min

Ling-flash-2.0 MXFP4 (bailing_moe) on SwiftLM + M1 Max 64GB: working config, support check, --stream-experts notes

Hands-on running inclusionAI Ling-flash-2.0 (100B / 6.1B active, MXFP4 quant, 54.7GB) on SwiftLM via mlx-swift-lm on an M1 Max 64GB. Covers bailing_moe + MXFP4 support check in mlx-swift, the startup surprise, and what --stream-experts actually saves.

Apple Silicon LLM MLX Local LLM Swift SwiftLM MoE MXFP4 Ant Group Experiment

Tech Apr 24, 2026 13 min

Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm

A hands-on build and run of the Swift-based LLM inference server SwiftLM on an M1 Max 64GB. Covers Qwen3.6-35B-A3B and Qwen3.5-122B-A10B, with the same BST, BBS, and persona tests used in the existing Ollama and MLX-lm write-ups.

Apple Silicon LLM MLX Local LLM Swift SwiftLM MoE Experiment

Tech Apr 23, 2026 13 min

Qwen3.6-27B Dense vs Qwen3.6-35B-A3B MoE on M1 Max — MLX Was 2× Faster Than Ollama

Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.

LLM Local LLM Qwen Ollama MLX Apple Silicon MoE Experiment

Tech Apr 19, 2026 13 min

Zero-copy GPU inference on Apple Silicon with WebAssembly and Metal

A three-link chain of mmap → MTLBuffer(bytesNoCopy) → Wasmtime MemoryCreator that makes a Wasm linear memory share the same physical bytes as a Metal GPU buffer. Llama 3.2 1B runs at 9ms/token on M1.

WebAssembly Metal AppleSilicon MLX Wasmtime LLM

Tech Apr 16, 2026 14 min

How Far Has AMD ROCm Come in Catching Up to CUDA?

Based on EE Times' interview with AMD AI Software VP Anush Elangovan, we assess the ROCm vs CUDA ecosystem gap. Includes hands-on experience with ROCm breaking four times on Strix Halo, plus practical guidance on choosing between NVIDIA, AMD, and Apple Silicon.

AMD NVIDIA ROCm CUDA GPU AI Infrastructure PyTorch MLX Apple Silicon

Tech Apr 2, 2026 updated 13 min

SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.

Apple Silicon LLM MLX Local LLM Inference Optimization KV Cache MoE Swift