Tech 3 min read

Running FLUX.2 Klein 9B on Apple Silicon Macs

What is FLUX.2 Klein

An image generation model released by Black Forest Labs (founded by members of the Stable Diffusion team). It’s a new line in the FLUX series.

Specs

ItemDetails
Parameters9 billion (9B)
ArchitectureRectified Flow Transformer
VRAM requirement~29GB
Inference speedUnder 1 second on RTX 4090
LicenseNon-commercial (FLUX Non-Commercial License)

Where it fits in the FLUX lineup

ModelParametersCharacteristics
FLUX.1 [pro/dev]12BFlagship
FLUX.1 [schnell]12B (distilled)Fast but less diverse output
FLUX.2 [klein]9BLightweight, no distillation
FLUX.2 [klein] 4B4BEven lighter

The key thing about klein is “lightweight without distillation.” schnell achieved speed through distillation, but sacrificed output diversity. klein reduces parameter count without distillation, keeping output diversity while cutting weight.


Does it run on Apple Silicon

Short answer: it runs, but it’s not practical

Expected results on M1 Max (64GB unified memory):

ItemStatus
Memory64GB clears the 29GB requirement
MPS supportdiffusers MPS support is unstable
FP8 quantizationNot supported
Estimated speed3-5 minutes for 1024x1024

A task that takes 12 seconds on an RTX 4090 takes over 3 minutes here.

Generation time comparison

Chip1024x1024 generation time
RTX 409012-18 seconds
M4 Max85 seconds
M3 Max105 seconds
M2 Max145 seconds
M1 MaxEstimated 180-240 seconds

Why the massive performance gap

The RTX 4090 has 24GB of VRAM, less than the model’s 29GB requirement. Yet it’s dramatically faster than M1 Max (64GB). VRAM size is a “pass/fail” thing — as long as you have enough, more doesn’t make it faster.

1. Memory bandwidth (the biggest factor)

GPU/ChipMemory bandwidth
RTX 40901,008 GB/s
M4 Max546 GB/s
M1 Max400 GB/s

Transformer inference is memory-bandwidth-bound. The bottleneck is the speed of reading weights from memory. The RTX 4090 has 2.5x the bandwidth of M1 Max, so it’s roughly that much faster.

2. FP8 quantization support

FormatCUDAMPS
FP16
BF16
FP8

FP8 quantization cuts VRAM usage in half and improves bandwidth efficiency. MPS doesn’t support FP8, so it can’t benefit from this.

3. Handling VRAM overflow

When running a 29GB model on an RTX 4090 (24GB):

pipe.enable_model_cpu_offload()  # 使わないレイヤーをRAMに退避
  • CUDA: GPU-CPU transfers are fast over PCIe 4.0 x16 (32GB/s)
  • MPS: Uses unified memory but Metal API overhead is significant

4. CUDA kernel optimization

NVIDIA has spent over a decade optimizing kernels for Transformers. Flash Attention, cuBLAS, TensorRT — all these kick in.

MPS is a relatively new API and its optimizations haven’t caught up yet.


Options for Apple Silicon users

Option 1: mflux (MLX implementation)

An MLX-based implementation optimized for Apple Silicon.

Option 2: flux2.c (pure C implementation)

A pure C implementation by antirez (creator of Redis). Targets the 4B version.

Option 3: Use a different model

If you’re not set on the 9B model:

  • FLUX.2 Klein 4B: Half the size
  • Z-Image: 6B, lightweight, Apache 2.0 license. I wrote a comparison article too

Option 4: Cloud GPUs

Rent an RTX 4090 by the hour on RunPod, Vast.ai, Lambda Labs, etc. If you’re serious about using this, this is the realistic option.