Tech 7 min read

Flash-MoE: Running a 397B-parameter model on a 48GB MacBook

Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B (a 497‑billion‑parameter Mixture of Experts model) on a MacBook Pro M3 Max. It achieves 4.36 tokens/second in a 48GB unified-memory environment and can produce production‑grade outputs including tool calls.

It’s published on GitHub with working code and detailed logs from 58 experiments. Rather than a simple “it works” note, it’s an engaging record of what helped and what didn’t.

Model Architecture

Qwen3.5-397B-A17B is an MoE (Mixture of Experts) model from Alibaba. “397B” is the total parameter count, and “A17B” is the number of parameters actually activated per token (about 17B). Instead of using all parameters every time, MoE selectively routes only the necessary experts based on the input.

Among the model’s 60 Transformer layers:

  • 45 layers: GatedDeltaNet (linear attention)
  • 15 layers: standard full attention

Each layer contains 512 experts, of which 4 are activated per token. The model’s weights fit into 209GB with 4‑bit quantization, which still doesn’t fit into 48GB of unified memory—so the design leans heavily on the SSD.

Core Idea: SSD Expert Streaming

The core design of Flash-MoE is to “load only the experts you need from SSD on demand.”

Per-expert weight size: ~6.75MB (with 4-bit quantization)
Weights read per token: 4 × 6.75MB ≈ 27MB
SSD read throughput (measured): 17.5GB/s

In the implementation, multiple experts are fetched concurrently via parallel pread() calls. By relying on the OS page cache, the cache hit rate naturally reaches roughly 71%, removing the need to build a custom cache.

Non‑expert weights (parts shared across all layers) total 5.5GB and are kept resident in memory, with 200MB reserved as scratch space during inference, for a total memory footprint of about 6GB. Out‑of‑memory risk is close to zero.

Metal Compute Shaders

GPU computations on the Mac are implemented with hand‑written Apple Metal shaders.

KernelContent
4‑bit matrix–vector productSIMD reductions, tiling, shared input cache
2‑bit matrix–vector productSame (faster but larger quantization error)
RMS normalizationTwo‑pass (sum of squares reduction → apply)
GPU attentionQ@K^T, softmax, scores@V
MoE combineWeighted sum of expert outputs

For the 4‑bit matrix–vector product, FMA (Fused Multiply‑Add) optimization pays off. A naïve order of operations is (nibble * scale + bias) * x, but rewriting it as fma(nibble, scale*x, bias*x) leverages the GPU’s FMA units and yields a 12% speedup.

Metal Performance Shaders (MPS) is Apple’s GPU computing framework analogous to CUDA. While CUDA targets NVIDIA GPUs, MPS runs on Apple Silicon’s integrated GPU. By implementing Metal shaders directly, Flash‑MoE achieves MPS‑class optimizations.

Speeding Up GatedDeltaNet

GatedDeltaNet, used in 45 layers, is a type of linear attention with lower compute than a conventional Transformer. The core is a recursive update of a “64 heads × 128×128 state matrix” in each layer.

This computation uses Apple’s Accelerate BLAS functions (cblas_sscal, cblas_sgemv, cblas_sger), delivering a 64% speedup over a scalar implementation.

Deferred GPU Execution (Pipeline Optimization)

Average per‑layer time is 4.28ms, broken down as:

TaskTime
GPU work (attention/normalization, etc.)1.22ms + 0.55ms
CPU work (GatedDeltaNet)0.013ms
SSD I/O (expert loads)2.41ms

Because I/O dominates, the pipeline uses deferred execution: enqueue GPU work (CMD3: expert forward) asynchronously, and keep the CPU busy preparing the next layer while the GPU runs. However, on Apple Silicon the SSD DMA and GPU compute share the memory controller, so excessive parallelism can actually slow things down. A mostly serial pipeline turned out to be optimal.

Performance

ConfigThroughputTool useModel size
4‑bit (FMA‑optimized)4.36 tok/sStable209GB
4‑bit (baseline)3.90 tok/sStable209GB
2‑bit5.74 tok/sUnstable120GB

The 2‑bit quantized build is faster, but it corrupted JSON outputs. Concretely, "name" turned into \name\, which broke parsing for tool calls. The 4‑bit configuration is recommended for production use.

Discarded Optimizations (from 58 experiments)

They also documented approaches that didn’t pan out.

ApproachResultReason
LZ4 compression−13% (worse)Decompression overhead outweighed read reduction
Memory‑map (mmap)×0.2 (much worse)Page‑fault overhead was high
Prefetching−73% (much worse)Unified‑memory bandwidth limits made it counterproductive
Predictive expert routing31% accuracy (ineffective)Prediction accuracy was too low

The prefetching result is especially interesting: while prefetch helps on typical NVMe SSDs, Apple Silicon’s architecture—where SSD, GPU, CPU, and NPU share a unified memory bus—caused prefetch to compete for bandwidth with other work and backfire.

How to Run

cd metal_infer
make
./chat                                          # ツール呼び出し対応チャット
./infer --prompt "説明してください" --tokens 100  # 単発推論
./infer --prompt "Hello" --tokens 20 --timing   # タイミング計測付き

Required environment: MacBook Pro M3 Max (48GB unified memory, 1TB Apple Fabric SSD) with macOS 26.2 or later. Download the Qwen3.5-397B-A17B model weights from Hugging Face in advance.

Offloading Strategies: Layer/Tensor vs. SSD Streaming

There are two broad approaches to the “model doesn’t fit in memory” problem for local LLM inference.

In the article on the BERT+Qwen OCR correction tool, I tried llama.cpp’s offload features. llama.cpp supports layer‑level CPU/GPU partitioning via --n-gpu-layers and tensor‑level partitioning via --override-tensor. Offloading FFN tensors to the CPU while keeping attention on the GPU has been reported on Reddit to improve speed by over 200%.

However, this approach assumes “the model fits in CPU memory + VRAM in total.” On an RTX 3050 Ti (4GB) + 16GB main memory, the practical ceiling was 4B–9B models, and with short prompts like A/B testing, the benefit from GPU parallelism is small, so tensor offload didn’t make a difference.

Flash‑MoE takes a fundamentally different route. A 209GB model fits in neither CPU memory nor VRAM, so the SSD is used as “very slow but very large VRAM.”

llama.cpp offloadFlash‑MoE SSD streaming
TargetModels that fit in memory (up to a few‑dozen GB)Models that do not fit in memory (209GB)
GPU useDistribute by layer/tensorAll GPU compute via Metal shaders
BottleneckCPU–GPU transfersSSD read bandwidth (17.5GB/s)
CacheNone (all weights resident in memory)OS page cache (71% hit rate)
MoE useGeneral (Dense/MoE alike)MoE‑specific (assumes 4/512 sparsity per token)

SSD streaming works for Flash‑MoE because of MoE sparsity—only 4 of 512 experts activate per token. Doing the same with a dense 397B‑parameter model would be impossible; reading all weights each token would make SSD bandwidth the total bottleneck. Conversely, llama.cpp’s layer offload works with dense models too, but only up to the size that fits in memory.

Apple Silicon’s unified memory affects the two approaches differently. With llama.cpp offload, CPU and GPU share a single address space, so the very notion of “offload” becomes fuzzy. When I ran the OCR correction tool on Apple Silicon (M1 Max 64GB), both BERT and the LLM naturally ran on the GPU, and I didn’t have to manage VRAM as carefully as in CUDA environments. In Flash‑MoE, however, the unified memory bus is shared by SSD DMA and GPU compute, so prefetch becomes counterproductive—an Apple‑Silicon‑specific constraint. The same unified‑memory architecture shows up as a benefit in one case and a constraint in the other.