Running FLUX.2 Klein 9B on Apple Silicon Macs
What is FLUX.2 Klein
An image generation model released by Black Forest Labs (founded by members of the Stable Diffusion team). It’s a new line in the FLUX series.
- Official page: FLUX.2-klein-base-9B - Hugging Face
Specs
| Item | Details |
|---|---|
| Parameters | 9 billion (9B) |
| Architecture | Rectified Flow Transformer |
| VRAM requirement | ~29GB |
| Inference speed | Under 1 second on RTX 4090 |
| License | Non-commercial (FLUX Non-Commercial License) |
Where it fits in the FLUX lineup
| Model | Parameters | Characteristics |
|---|---|---|
| FLUX.1 [pro/dev] | 12B | Flagship |
| FLUX.1 [schnell] | 12B (distilled) | Fast but less diverse output |
| FLUX.2 [klein] | 9B | Lightweight, no distillation |
| FLUX.2 [klein] 4B | 4B | Even lighter |
The key thing about klein is “lightweight without distillation.” schnell achieved speed through distillation, but sacrificed output diversity. klein reduces parameter count without distillation, keeping output diversity while cutting weight.
Does it run on Apple Silicon
Short answer: it runs, but it’s not practical
Expected results on M1 Max (64GB unified memory):
| Item | Status |
|---|---|
| Memory | 64GB clears the 29GB requirement |
| MPS support | diffusers MPS support is unstable |
| FP8 quantization | Not supported |
| Estimated speed | 3-5 minutes for 1024x1024 |
A task that takes 12 seconds on an RTX 4090 takes over 3 minutes here.
Generation time comparison
| Chip | 1024x1024 generation time |
|---|---|
| RTX 4090 | 12-18 seconds |
| M4 Max | 85 seconds |
| M3 Max | 105 seconds |
| M2 Max | 145 seconds |
| M1 Max | Estimated 180-240 seconds |
Why the massive performance gap
The RTX 4090 has 24GB of VRAM, less than the model’s 29GB requirement. Yet it’s dramatically faster than M1 Max (64GB). VRAM size is a “pass/fail” thing — as long as you have enough, more doesn’t make it faster.
1. Memory bandwidth (the biggest factor)
| GPU/Chip | Memory bandwidth |
|---|---|
| RTX 4090 | 1,008 GB/s |
| M4 Max | 546 GB/s |
| M1 Max | 400 GB/s |
Transformer inference is memory-bandwidth-bound. The bottleneck is the speed of reading weights from memory. The RTX 4090 has 2.5x the bandwidth of M1 Max, so it’s roughly that much faster.
2. FP8 quantization support
| Format | CUDA | MPS |
|---|---|---|
| FP16 | ✓ | ✓ |
| BF16 | ✓ | ✓ |
| FP8 | ✓ | ✗ |
FP8 quantization cuts VRAM usage in half and improves bandwidth efficiency. MPS doesn’t support FP8, so it can’t benefit from this.
3. Handling VRAM overflow
When running a 29GB model on an RTX 4090 (24GB):
pipe.enable_model_cpu_offload() # 使わないレイヤーをRAMに退避
- CUDA: GPU-CPU transfers are fast over PCIe 4.0 x16 (32GB/s)
- MPS: Uses unified memory but Metal API overhead is significant
4. CUDA kernel optimization
NVIDIA has spent over a decade optimizing kernels for Transformers. Flash Attention, cuBLAS, TensorRT — all these kick in.
MPS is a relatively new API and its optimizations haven’t caught up yet.
Options for Apple Silicon users
Option 1: mflux (MLX implementation)
An MLX-based implementation optimized for Apple Silicon.
- Repository: filipstrand/mflux
- Check whether FLUX.2 Klein is supported
Option 2: flux2.c (pure C implementation)
A pure C implementation by antirez (creator of Redis). Targets the 4B version.
- Repository: antirez/flux2.c
make mpsenables MPS acceleration
Option 3: Use a different model
If you’re not set on the 9B model:
- FLUX.2 Klein 4B: Half the size
- Z-Image: 6B, lightweight, Apache 2.0 license. I wrote a comparison article too
Option 4: Cloud GPUs
Rent an RTX 4090 by the hour on RunPod, Vast.ai, Lambda Labs, etc. If you’re serious about using this, this is the realistic option.