Can FLUX.2 Klein NSFW LoRAs actually run on an M1 Max?

In the previous experiment running FLUX.2 Klein 4B on M1 Max, NSFW prompts got rounded off somewhere in the pipeline.
Photorealistic outputs escaped into crops or cloth-like textures; anime-style outputs got partially through but never fully.
The model clearly has a safety-leaning bias baked in.

I then found NSFW LoRAs for FLUX.2 Klein 9B on Hugging Face.
The question: can the current local environment actually run them? The LoRA targets Klein 9B, so dropping it onto a 4B setup is not a safe combination.
With mflux currently running only 4B, even though both are FLUX.2 Klein family, the model dimensions don’t match.

The available LoRA targets Klein 9B

The one I found is diroverflo/FLux_Klein_9B_NSFW.
The model card describes it as a LoRA trained specifically for FLUX.2 Klein 9B.

The relevant detail here is the branching within FLUX.2 Klein.
Black Forest Labs’ official repository lists Klein 4B, 9B, 9B KV, and Base variants.
The rough positioning: Klein 4B for realtime/local GPU, Klein Base or FLUX.2 Dev for LoRA training and flexibility.

So while “FLUX.2 Klein LoRA” sounds like one thing, 4B / 9B / Base / distilled are not necessarily cross-compatible.
RunComfy’s FLUX.2 Dev LoRA workflow treats LoRAs for FLUX.1, FLUX.2 Dev, and Klein variants as incompatible by default.
Applying a LoRA to the wrong base results in lora key not loaded, shape mismatches, or — worse — silent failure where the LoRA simply has no effect.

The local setup is 4B mflux

What’s actually been benchmarked locally: flux2-klein-4b on M1 Max 64GB.
mflux 0.17.5 generates 1024×1024 in ~30s, iris.c in ~40s.
The FastAPI wrapper I built recently also uses mflux-generate-flux2 as a subprocess rather than routing through ComfyUI for FLUX.2.

The advantage of this setup is lightness — Python + MLX, no ComfyUI workflow diffs or custom nodes to track.
But since the LoRA targets 9B, it’s not something to test within the current 4B mflux path.

Testing via mflux requires 9B first

mflux supports external LoRA application.
When I added the Lightning LoRA in the Qwen Image Edit experiment, quality visibly changed via --lora-paths.
The Klein 4B article confirmed that mflux has LoRA support while iris.c does not.

The real question isn’t “can mflux load a LoRA” but “can Klein 9B run on mflux at all within reason.”
Klein 4B is benchmarked; 9B is heavier in both memory and time.
M1 Max 64GB probably has enough memory, but the 30s/image of 4B is not the baseline to expect.

Forcing a 9B LoRA onto 4B and concluding “it doesn’t work” would be a weak isolation.
The proper sequence: first confirm flux2-klein-9b alone completes generation on mflux, then compare with/without LoRA at the same seed and prompt, adjusting --lora-scales to verify the intended effect.

ComfyUI brings a different kind of LoRA path trap

ComfyUI has templates for FLUX.2 Dev and Klein.
The official tutorial shows a workflow-loading approach that assumes the latest ComfyUI.

But FLUX.2 LoRA application isn’t always just Load LoRA -> KSampler.
RunComfy’s workflow applies LoRAs through a Flux2Pipeline-aware path.
Attempting to layer a generic LoRA loader onto FP8 weights can hit mul_cuda unimplemented for Float8, or LoRA key mismatches that produce garbled output.

On Mac the problems are of the same species.
CUDA-specific errors don’t appear, but MPS lacks FP8 support, and BF16/FP16 handling can produce black images or sudden slowdowns.
As seen in the Qwen Image Edit article, ComfyUI + MPS breaks in novel ways with each model or dtype update.

mflux is the stable path locally.
If testing 9B LoRA via ComfyUI, use a dedicated workflow or nodes that explicitly declare Klein 9B LoRA support.

Verification path matrix

The LoRA can’t be added to the current Klein 4B workflow as-is.
Using it requires preparing Klein 9B and confirming bare generation works via mflux or ComfyUI first.

Path	Verdict	Reason
mflux + Klein 4B	Not viable	LoRA targets 9B. Size mismatch likely produces no effect or errors
mflux + Klein 9B	Needs testing	LoRA path exists but 9B speed/memory untested
iris.c + Klein 4B	Impossible	No LoRA application path in iris.c
ComfyUI + Klein 9B	Needs testing	Depends on workflow. Generic LoRA loader alone is unreliable
RunPod + NVIDIA GPU	Cleanest isolation	9B + LoRA load straightforwardly; avoids Mac MPS issues

The practical next step: rather than grinding against M1 Max MPS issues, verify Klein 9B + LoRA effectiveness on RunPod or a local NVIDIA machine first.
Once the LoRA’s effect is confirmed, bring it to mflux on Apple Silicon.
Reversing the order means model compatibility, LoRA format, and MPS performance all tangle at once, making failure diagnosis impossible.

Speed impact of LoRA application

The Klein 4B article confirmed sub-30s generation at 1024×1024 with mflux.
Fast enough that iteration doesn’t feel painful.

With a LoRA loaded, mflux’s --lora-paths applies weights on-the-fly during the inference loop.
For low-rank LoRAs (rank 4–16), the addition is marginal.
When Lightning LoRA was added in the previous experiment, the slowdown was roughly 5–10%.
But that was 4B. If 9B base inference takes 60–90s, adding 5–10% puts it at 65–100s.
Clearly slower than 4B’s 30s, but under 2 minutes per image is still local-verification territory.

The concern is high-rank LoRAs (rank 64+) or stacking multiple LoRAs simultaneously.
The multiplication cost becomes visible and doesn’t suit a rapid-iteration workflow.
This particular NSFW LoRA doesn’t declare its rank on the model card, so it’s unknowable until actually loaded.

Does anime style actually come out of Flux?

The other open question is art style.
FLUX.2 Klein is a photorealism-oriented architecture. The previous experiment showed that anime prompts produce “something anime-like” but the line quality was unmistakably AI-generated — nowhere near WAI-Anima or Illustrious-class sharpness.

If this NSFW LoRA was trained on photorealistic nude data, anime compatibility is unlikely.
LoRAs trained on photorealistic skin textures produce chimera-like outputs when forced into anime style prompts.
If a separate anime-style NSFW LoRA emerges for Klein 9B, that would be the test for anime expression.

The practical fork:
Photorealistic NSFW → Klein 9B + this LoRA, test directly.
Anime NSFW → Switch to ComfyUI + WAI-Anima checkpoint + anime-specific NSFW LoRA.
The latter path is already working, so only the photorealistic path needs new validation.

Why RunPod over local

M1 Max 64GB should meet Klein 9B’s memory requirements.
But if inference takes 60–90s per image, tuning LoRA scales and iterating on prompts gets painful.
50 images means 80+ minutes. Not a knowledge or setup problem — pure iteration speed.

On top of that, Mac + MPS dtype issues.
BF16 FLUX models in ComfyUI on MPS tend to produce black images or FP8-unsupported errors.
You can’t tell whether the LoRA isn’t working or whether it’s an MPS issue.
RunPod with CUDA lets you confirm LoRA effectiveness cleanly, separating it from Mac porting concerns.

RTX 4090 (24GB VRAM) is the cheapest option that can probably load Klein 9B.
RunPod Community Cloud at ~$0.4/hr. One 2–3 hour session for $1.
Cheaper than spending half a day stuck on MPS issues.
If 24GB isn’t enough, step up to A100 40GB ($0.8/hr).

Klein 9B + LoRA verification on RunPod

RunPod has ComfyUI templates with FLUX model support — text encoders (T5-XXL and CLIP-L) often come preinstalled.
WebUI is available immediately after pod start; setup is just downloading the model and LoRA.

# Klein 9B base model
huggingface-cli download black-forest-labs/FLUX.2-Klein-9B \
  --local-dir /workspace/ComfyUI/models/unet/

# NSFW LoRA
huggingface-cli download diroverflo/FLux_Klein_9B_NSFW \
  --local-dir /workspace/ComfyUI/models/loras/

If the template doesn’t include encoders, that’s an extra 20GB+ download.
Check the template description for FLUX support before launching.

After downloads complete, load a Klein 9B workflow in ComfyUI. Generate one image without LoRA first.
Then generate with the same seed and prompt with LoRA enabled. Compare.
Sweep lora_scale from 0.5 → 0.7 → 1.0, checking whether NSFW expression strengthens progressively without color corruption or black outputs.

If CUDA produces correct output, the LoRA itself is functional.
Then loading the same LoRA in mflux + 9B and getting different results narrows the cause to MLX’s LoRA path or MPS dtype handling.
Conversely, if RunPod also shows weak effects, the issue is the LoRA’s rank or training data — no point trying Mac porting before finding a better LoRA or training one.

Generated images land in /workspace/ComfyUI/output/.
Retrieve via RunPod’s file browser or runpodctl send.

Checking the LoRA’s rank beforehand

Before RunPod verification, checking the LoRA file’s metadata for rank and alpha values helps set initial lora_scale.
safetensors headers typically contain network configuration.

from safetensors import safe_open

with safe_open("flux_klein_9b_nsfw.safetensors", framework="pt") as f:
    print(f.metadata())
    for key in list(f.keys())[:5]:
        print(key, f.get_tensor(key).shape)

Key names contain lora_down and lora_up; the lora_down shape is (rank, in_features).
Rank 16 or below — scale 0.7–1.0 works cleanly.
Rank 64+ — 1.0 tends to corrupt colors; start at 0.3–0.5.

Since the model card doesn’t state the rank, checking this first prevents blind scale tuning.

Training your own Klein 9B LoRA

If the existing LoRA’s effect is weak or the art style doesn’t match, training your own is the fallback.
The two main FLUX.2 LoRA training tools are ostris’ ai-toolkit and kohya_ss’ sd-scripts.
Both have supported FLUX architecture since the FLUX.2 Dev era, and Klein 9B shares the same Transformer structure.

CUDA is mandatory for training.
Apple Silicon’s MLX has LLM LoRA training, but no image-generation LoRA training pipeline for DiT architectures like FLUX.2.
Training runs on RunPod or a local NVIDIA machine; the resulting safetensors file gets brought to mflux for inference.

VRAM

LoRA training consumes more VRAM than inference.
Reports suggest FLUX.2 Dev (12B-class) LoRA training with gradient checkpointing + adafactor barely fits RTX 4090’s 24GB.
Klein 9B is slightly smaller, but full fine-tuning is out of the question — LoRA is the only option.

Configuration	Estimated VRAM	Notes
rank 16 + gradient checkpointing + adafactor	20–22GB	Should fit RTX 4090
rank 64 + gradient checkpointing	24–28GB	A100 40GB is safe
QLoRA (4-bit quantized base + LoRA)	16–20GB	Supported in ai-toolkit. Quality needs verification

If rank 16 fits on RTX 4090, the same RunPod pod used for inference testing can also run training.
Even stepping up to A100 at ~$0.8/hr, a 2–3 hour training session costs ~$2.

ai-toolkit configuration

ai-toolkit uses YAML config files and runs with a single command.
Adapting a FLUX.2 config for Klein 9B:

job: extension
config:
  name: "flux2_klein_9b_nsfw_lora"
  process:
    - type: sd_trainer
      training_folder: "output"
      device: cuda:0
      network:
        type: lora
        linear: 16
        linear_alpha: 16
      model:
        name_or_path: "black-forest-labs/FLUX.2-Klein-9B"
        is_flux: true
      train:
        batch_size: 1
        steps: 2000
        gradient_accumulation_steps: 1
        gradient_checkpointing: true
        optimizer: adafactor
        lr: 4e-4
      save:
        save_every_n_steps: 500
      sample:
        sample_every_n_steps: 250
        prompts:
          - "a portrait photo of a woman"
      datasets:
        - folder_path: "/data/nsfw_dataset"
          caption_ext: ".txt"
          resolution: 1024

save_every_n_steps saves checkpoints every 500 steps; sample_every_n_steps generates preview images for monitoring.
2000 steps is a starting point — with ~20 training images, convergence can happen around 1000.
If generated samples start looking like copies of training data, that’s overfitting; stop there.

sd-scripts offers more network architecture options (LoHA, LoKr) but has more CLI flags and steeper initial setup.
For Klein 9B-specific adjustments (distilled structure handling), ai-toolkit’s issue tracker tends to respond faster.
Without a strong preference, ai-toolkit is the quicker starting point.

Training data quality

For NSFW LoRAs, style consistency matters more than quantity.
20–30 images with consistent poses, lighting, and art style produce more stable results than 100 mixed images.

Captions: auto-tag with WD Tagger or Florence, then manually prepend a trigger word.
For example, setting the trigger word to nsfw_v1 means adding nsfw_v1, to prompts at inference time activates the LoRA.
Without a trigger word, the LoRA’s effect spreads thinly across all prompts with no clean ON/OFF switch.

Mixing photorealistic and anime data produces middling results.
Same issue as the art style section above — photorealistic NSFW should train on photorealistic data only, anime NSFW on anime only.

Monitoring training progress

ai-toolkit outputs sample images at sample_every_n_steps intervals.
Two things to check: “is it broken” and “is the LoRA effect showing.”

At 250 steps, output barely differs from the base model.
Between 500–1000 steps, training data characteristics start appearing.
If poses or compositions not present in the prompt start leaking in, that’s overfitting — the checkpoint around that point is the one to use.

Loss values aren’t very reliable for FLUX LoRA training.
DiT architecture loss fluctuates more than UNet-based models; judging convergence by numbers alone can miss overfitting.
Visual inspection of sample outputs is the most trustworthy signal.

Typical overfitting symptoms: samples converge toward specific training images, faces distort, backgrounds melt, or color palette narrows to one tone.
When this happens, roll back to an earlier checkpoint.
ai-toolkit’s save_every_n_steps keeps multiple checkpoints, so pick the best one for downstream verification.

Comparing trained LoRAs

Load the safetensors into ComfyUI on RunPod for inference evaluation.

Same seed, same prompt — compare images with and without LoRA.
If there’s barely any difference, either training steps were insufficient, data quality was poor, or scale is too low.

Sweep lora_scale at 0.3, 0.5, 0.7, 1.0 and observe where NSFW expression changes.
If 1.0 produces color corruption or face distortion, scale is too high relative to rank.
Either use 0.5–0.7 for production or retrain at a lower rank.

Also test prompts that differ from training data.
Whether the effect generalizes to unseen poses and scenes indicates practical usability.
A LoRA that only works for poses in its training set requires per-prompt trial and error — not practical.

Bringing to mflux

Once LoRA effectiveness is confirmed on RunPod, pull the safetensors locally.

runpodctl receive <file-id>

mflux just needs --lora-paths pointing to the safetensors.

mflux-generate-flux2 \
  --model flux2-klein-9b \
  --lora-paths ./flux2_klein_9b_nsfw_lora.safetensors \
  --lora-scales 0.7 \
  --prompt "..." \
  --seed 42 \
  --steps 30 \
  --width 1024 --height 1024

Compare output with RunPod using the same seed and prompt.
If results match, run it on M1 Max going forward.
If results differ, the cause is limited to MLX’s LoRA application path or MPS dtype handling.
The previous Klein 4B experiment showed Lightning LoRA adding only 5–10% overhead, so measure whether 9B sees similar.