Laxhar's SenseNova U1 LoRA trainer: bf16 on 32GB GPU, ~20GB peak VRAM

Laxhar/sensenova-u1-lora-trainer showed up on Hugging Face. A single-GPU LoRA trainer for SenseNova-U1-8B-MoT — basically one YAML config to get started.

The README targets 32GB GPUs with bf16 training, not exactly “light” for an 8B model. The model card lists RTX 5090, A100 40GB, and RTX 6000 Ada as reference hardware — 32GB+ CUDA 12 cards with 2048px training data at ~20GB peak VRAM. 24GB cards can handle sample generation, but default bucket settings exclude them from training.

As of June 9, 2026, there are a few relevant pieces besides the Laxhar trainer: SenseNova’s official 8B-MoT model and 8-step LoRA, plus a full-parameter fine-tuning training/ directory that was recently added to the OpenSenseNova repo. The official training/ code matters, but its baseline config assumes 8x 80GB GPUs — a different world from “burn a small style LoRA on one 32GB card.”

I recently worked with AnimaLoraToolkit for WAI-Anima LoRA training, training a 2B DiT + Qwen3 TE LoRA on RunPod. This U1 trainer targets a different model entirely. It’s less about producing a LoRA that ComfyUI can load and more about outputting upstream-format safetensors that plug directly into the official examples/t2i/inference.py.

bf16 offload, not 4bit quantization

The README and docs/SETUP.md make the weight-precision story clear. They tried 4bit nf4 and 8bit base LoRA training, but gen tower output showed grid patterns, scan lines, and limb artifacts. Base weights went back to bf16.

Instead, heavy components get split between CPU and GPU. The prefix tower handles prompt and image-condition encoding; the gen tower runs the flow-matching diffusion loop. Rather than keeping both on GPU, the prefix tower loads onto GPU for a one-shot prefix-forward pass, builds a static prefix-KV cache, then moves back to CPU.

During training steps, only the gen tower stays on GPU. SETUP reports ~3.3GB for the prefix-KV cache (56 samples), and with bf16 base, bf16 LoRA, paged AdamW8bit, and partial gradient checkpointing, peak VRAM on a 32GB card lands around 20GB.

Because low-bit training degraded U1’s image generation quality, the design keeps weights at bf16 and shaves VRAM through tower offload and caching.

flowchart TD
    A["画像 + キャプション"] --> B["prefix towerをGPUへ<br/>prefix-forward"]
    E["bf16 base weights<br/>CPU上に保持"] -.->|offload| B
    B --> C["static prefix-KV cache<br/>56 samples 約3.3GB"]
    B --> D["prefix towerをCPUへ戻す"]
    E -.->|gen tower| F["gen towerをGPUに固定<br/>flow matching学習"]
    C --> F
    F --> G["LoRA attn+mlp<br/>+ 一部full fine-tune"]
    G --> H["trainable_state.safetensors"]

    style F fill:#1d4ed8,color:#fff
    style C fill:#065f46,color:#fff

The prefix side runs once and stays cached; the training loop keeps only the gen tower and trainable surface on GPU. This is tower-level offloading, not base-weight quantization.

default.yaml: aimed at small style datasets

The default config is called small-data style baseline in the model card. configs/default.yaml combines x0 loss, uniform timestep, no condition dropout, short captions, attn+mlp LoRA, timestep/noise embedder, gen vision bridge, and fm_head partial fine-tuning.

Data layout is straightforward: images with same-name .txt caption files. SETUP suggests 16–256 images for style transfer. Captions describe image content, and the trainer prepends a style.trigger token.

The README goes further — it also supports Parquet/Arrow shards, though those are for >= ~10k images. For a first style LoRA, flat folders with .txt files are simpler. Think labels via ---think--- delimiters are supported but off by default (use_think_labels: false), because low-quality reasoning text can dominate the prefix and break style binding.

When I trained character LoRAs for WAI-Anima, mixing Danbooru tags with natural-language captions was a real problem. In the caption rework article, I rewrote captions to feed directional info to the Qwen3 TE. The U1 trainer’s approach is similar: separate the style trigger from per-image content captions instead of baking style into every caption.

This isn’t a polished GUI tool, though. The install process pins OpenSenseNova/SenseNova-U1 modeling code to commit df86ca90 and copies it into the Hugging Face snapshot. Since it uses trust_remote_code=True, the trainer verifies modeling files by sha256 before overwriting.

One confusing naming detail: the LoRA parser’s default preset targets the same coverage as the official 8-step LoRA, but configs/default.yaml actually uses the attn_mlp_no_head preset. fm_head gets full fine-tuned instead of LoRA’d.

Config/Preset	Actual meaning
`configs/default.yaml`	small-data style baseline; LoRA on attn+mlp, `fm_head` etc. full fine-tuned
`lora.preset: default`	coverage name that LoRA-ifies attn+mlp+fm_head
`official_r128`	r128 coverage matching official 8-step LoRA shape

The saved trainable_state.safetensors isn’t necessarily a drop-in LoRA for any UI. On top of lora_down/lora_up/.alpha upstream LoRA keys, the default config includes plain full-fine-tuned parameters. That works fine for the official examples/t2i/inference.py pipeline, but if a ComfyUI node or another UI expects pure LoRA triples, check the key names and loader compatibility first.

Separate from the official training code

OpenSenseNova/SenseNova-U1 added full-parameter fine-tuning in training/ on May 21, 2026. This makes Laxhar’s README note about “upstream training code not yet released” look outdated.

The official training/README.md minimum config is heavy, though. shell/train_u1/8B.sh assumes 8x 80GB GPUs; A3B.sh assumes 16x 80GB GPUs. It handles mixed tasks — text-to-image, image editing, interleaved generation, OCR/VQA — and converts checkpoints from internevo shards to Hugging Face format.

The official sample dataset exists, but it’s a smoke test for dataloaders and loss code paths, not a training corpus. The official training code release matters, but it doesn’t directly replace the “burn a small style LoRA on one 32GB GPU” workflow. Comparing the two actually clarifies Laxhar’s niche: a single-GPU LoRA / partial-finetune tool that outputs weights in the official inference format, not a distributed full-fine-tuning system.

Stacking on top of the official 8-step LoRA

SenseNova released the official SenseNova-U1-8B-MoT-LoRA-8step-V1.0 on May 6, 2026. The base model card recommends 8 steps, cfg_scale=1.0, num_steps=8.

Laxhar’s trainer includes a config to stack a custom style LoRA on top of this official 8-step LoRA. configs/stack_8step.yaml bakes the official 8-step delta into the bf16 base at training time, skips fm_head, then trains the custom LoRA on top. At sampling time, it passes the same upstream LoRA with 8 steps, CFG 1.0, timestep shift 3.0.

In my Z-Image-Turbo de-distillation article, training a LoRA on a distilled fast model broke the low-step generation path. The U1 trainer’s 8-step stack is a different implementation, but the problem space overlaps: stacking on the official 8-step LoRA changes not just the target style but also fast-inference behavior.

I haven’t compared these yet. To separate what breaks where, you’d need the same style dataset, same seed, same prompt, running both 50-step and 8-step inference on both default and 8-step-stack trained weights.

A3B MoE targets: compatibility layer only

The model card mentions A3B / MoE Status. It includes experimental grammar for MoE LoRA targets like gen_moe_mlp, gen_moe_router, and mlp_mot_gen.experts.*.gate_proj.

This is not a stable training path. The README states that stable support targets SenseNova-U1-8B-MoT only, and A3B training depends on public MoE runtime support instantiating the target modules. The current MoE target grammar reads more like a compatibility layer for estimating LoRA size from A3B metadata.

On the SenseNova U1 model card, the Lite series has both 8B-MoT (dense backbone) and A3B-MoT (MoE backbone). Which modules a LoRA targets differs between 8B and A3B even within the same U1 family.

The official training/ code does include A3B MoE full-parameter training surface, but that assumes 16x 80GB GPUs in a distributed setup. That doesn’t mean Laxhar’s A3B LoRA grammar runs stably on a single GPU.

RunPod with 32GB+ as the starting point

This isn’t a tool you casually try on a Mac or a 24GB VRAM card. Following the README: 32GB+ CUDA GPU, 64GB+ CPU RAM, 80GB+ disk. The Hugging Face snapshot alone is ~17GB, and checkpoints add up.

My plan is to start on RunPod with an RTX 6000 Ada or A100 40GB, running default.yaml as-is. Blackwell RTX 5090 is listed as supported hardware, but you’d need to sort out torch 2.9, CUDA runtime, cu128 wheels, and bitsandbytes compatibility. When I used AnimaLoraToolkit, xformers bumping torch to cu130 broke the Ada environment — for these trainers, pinning the Python/CUDA stack matters more than GPU generation.

Starting with a small style dataset rather than character LoRA makes sense. The trainer’s ablation study targets small style datasets, and the default config is built around that. Evaluating character identity, limbs, and facial reproduction all at once makes it hard to tell whether issues come from the trainer or the dataset.

Pre-flight checks before running:

Item	Reason
CUDA / torch / bitsandbytes	RTX 5090 often needs cu128 wheels
`trainable_state.safetensors` keys	May contain full-FT params alongside LoRA
default vs 8-step stack comparison	Separate 50-step vs 8-step degradation
Captions and trigger	Avoid baking style redundantly into captions
24GB setup	Not a training target for default buckets — sample verification only

I haven’t run any actual training yet, so the ~20GB peak VRAM and 4/8-bit artifact claims are from the README/SETUP. Character LoRA identity, limb/face quality, and 8-step stack style-vs-speed tradeoffs remain untested.