Running FLUX.2 Klein 9B on Apple Silicon Macs

2026-04-30 update: A follow-up with hands-on benchmarks of the 4B model on M1 Max 64GB. Both mflux and iris.c can render 1024×1024 in 30–40 seconds → FLUX.2 Klein 4B benchmarked on M1 Max with mflux vs iris.c

What is FLUX.2 Klein

An image generation model released by Black Forest Labs (founded by members of the Stable Diffusion team). It’s a new line in the FLUX series.

Official page: FLUX.2-klein-base-9B - Hugging Face

Specs

Item	Details
Parameters	9 billion (9B)
Architecture	Rectified Flow Transformer
VRAM requirement	~29GB
Inference speed	Under 1 second on RTX 4090
License	Non-commercial (FLUX Non-Commercial License)

Where it fits in the FLUX lineup

Model	Parameters	Characteristics
FLUX.1 [pro/dev]	12B	Flagship
FLUX.1 [schnell]	12B (distilled)	Fast but less diverse output
FLUX.2 [klein]	9B	Lightweight, no distillation
FLUX.2 [klein] 4B	4B	Even lighter

The key thing about klein is “lightweight without distillation.” schnell achieved speed through distillation, but sacrificed output diversity. klein reduces parameter count without distillation, keeping output diversity while cutting weight.

Does it run on Apple Silicon

Short answer: it runs, but it’s not practical

Expected results on M1 Max (64GB unified memory):

Item	Status
Memory	64GB clears the 29GB requirement
MPS support	diffusers MPS support is unstable
FP8 quantization	Not supported
Estimated speed	3-5 minutes for 1024x1024

A task that takes 12 seconds on an RTX 4090 takes over 3 minutes here.

Generation time comparison

Chip	1024x1024 generation time
RTX 4090	12-18 seconds
M4 Max	85 seconds
M3 Max	105 seconds
M2 Max	145 seconds
M1 Max	Estimated 180-240 seconds

Why the massive performance gap

The RTX 4090 has 24GB of VRAM, less than the model’s 29GB requirement. Yet it’s dramatically faster than M1 Max (64GB). VRAM size is a “pass/fail” thing — as long as you have enough, more doesn’t make it faster.

1. Memory bandwidth (the biggest factor)

GPU/Chip	Memory bandwidth
RTX 4090	1,008 GB/s
M4 Max	546 GB/s
M1 Max	400 GB/s

Transformer inference is memory-bandwidth-bound. The bottleneck is the speed of reading weights from memory. The RTX 4090 has 2.5x the bandwidth of M1 Max, so it’s roughly that much faster.

2. FP8 quantization support

Format	CUDA	MPS
FP16	✓	✓
BF16	✓	✓
FP8	✓	✗

FP8 quantization cuts VRAM usage in half and improves bandwidth efficiency. MPS doesn’t support FP8, so it can’t benefit from this.

3. Handling VRAM overflow

When running a 29GB model on an RTX 4090 (24GB):

pipe.enable_model_cpu_offload()  # 使わないレイヤーをRAMに退避

CUDA: GPU-CPU transfers are fast over PCIe 4.0 x16 (32GB/s)
MPS: Uses unified memory but Metal API overhead is significant

4. CUDA kernel optimization

NVIDIA has spent over a decade optimizing kernels for Transformers. Flash Attention, cuBLAS, TensorRT — all these kick in.

MPS is a relatively new API and its optimizations haven’t caught up yet.

Options for Apple Silicon users

Option 1: mflux (MLX implementation)

An MLX-based implementation optimized for Apple Silicon.

Repository: filipstrand/mflux
Check whether FLUX.2 Klein is supported

Option 2: flux2.c (pure C implementation)

A pure C implementation by antirez (creator of Redis). Targets the 4B version.

Repository: antirez/flux2.c
make mps enables MPS acceleration

Option 3: Use a different model

If you’re not set on the 9B model:

FLUX.2 Klein 4B: Half the size
Z-Image: 6B, lightweight, Apache 2.0 license. I wrote a comparison article too

Option 4: Cloud GPUs

Rent an RTX 4090 by the hour on RunPod, Vast.ai, Lambda Labs, etc. If you’re serious about using this, this is the realistic option.