Tech 6 min read

UltraFlux-v1 - a native 4K image generation model based on FLUX.1-dev

IkesanContents

A model has appeared that pushes FLUX.1-dev all the way toward “native 4K generation”: W2GenAI Lab’s UltraFlux-v1. The paper was published on arXiv in November 2025, and the weights are also available on Hugging Face.

In the FLUX ecosystem, FLUX.2 Klein went in the direction of smaller models, and Schnell focused on distillation for speed. UltraFlux goes in the opposite direction: “make 4096×4096 native.” I had just written about Z-Image, so I wanted to compare the two approaches.

Basic specs

ItemDetails
Base modelFLUX.1-dev (Black Forest Labs)
ParametersBased on FLUX.1-dev, about 12B equivalent
Supported resolutionUp to 4096×4096, with multiple aspect ratios
Inference steps50
Guidance scale4
Data typebfloat16
LicenseApache 2.0
PaperarXiv:2511.18050

Technical points

UltraFlux’s concept is data-model co-design. Instead of solving each requirement for 4K generation separately, it designs the dataset, architecture, loss function, and training curriculum as a single system.

Resonance 2D RoPE + YaRN

FLUX.1-dev’s position encoding (RoPE) was designed for 1K to 2K resolutions. If you extend it to 4K directly, position information breaks down.

UltraFlux uses Resonance RoPE to stabilize frequency components in the high-resolution range, and YaRN (Yet another RoPE extensioN) to extrapolate context length. It applies a context-extension trick originally used for LLMs to the 2D space of image generation. That keeps position information stable across square, portrait, and ultrawide aspect ratios.

VAE post-training

The standard FLUX VAE is optimized for 1K resolutions. If you compress and reconstruct 4K images with it, fine detail gets lost.

UltraFlux improves the VAE through non-adversarial post-training, without using adversarial learning or GAN loss. The goal is to improve reconstruction of fine textures and edges in 4K images while keeping training stable.

SNR-Aware Huber Wavelet loss

In ordinary diffusion training, gradient balance across noise levels tends to drift. High-frequency components such as textures and edges are especially easy to underweight.

This loss rebalances gradients across frequency bands using wavelet decomposition and weights them according to SNR, or signal-to-noise ratio. Huber loss also reduces the influence of outliers. The result is better sharpness in 4K images.

Stage-wise Aesthetic Curriculum Learning (SACL)

The model uses curriculum learning that starts with diverse data for generalization and then concentrates on high-quality data in the high-noise training stage. Instead of mixing everything blindly, it controls the quality of what the model sees at each stage.

The MultiAspect-4K-1M dataset

The training dataset was released together with the model.

  • 1 million native 4K images
  • Bilingual captions in English and Chinese
  • VLM and IQA metadata
  • Sampling designed to balance aspect ratios

The dataset is not fully public yet, but the paper describes its structure in detail.

Position in the FLUX ecosystem

There are now enough FLUX-based models that it helps to line them up.

ModelDirectionParametersResolutionLicense
FLUX.1 dev/proflagship12Bup to 2Kdev: non-commercial
FLUX.1 Schnelldistilled for speed12Bup to 2KApache 2.0
FLUX.2 Klein 9Blighter without distillation9Bup to 2Knon-commercial
FLUX.2 Klein 4Beven lighter4Bup to 2Knon-commercial
UltraFlux-v14K-focused12B equivalentup to 4KApache 2.0

FLUX.2 Klein means “same quality, smaller model.” UltraFlux means “same size, higher resolution.” The directions are opposite, so the use cases are different too.

Comparison with Z-Image

Z-Image is also being watched as a challenger to FLUX, but the approach is completely different.

UltraFlux-v1Z-Image
Basemodified FLUX.1-devindependent design (S3-DiT)
Parameters12B equivalent6B
Max resolution4096×40962048×2048
Minimum VRAM24GB+ (estimated)6GB with quantization
Negative promptunsupported, following FLUXsupported
LoRA compatibilityunclear for FLUX LoRAsown ecosystem
Strength4K generation qualitylightness and parameter efficiency

Z-Image is an efficiency play: “beat a 12B model with 6B.” UltraFlux is a quality play: “raise the resolution ceiling of a 12B model.” The model you choose depends on your environment and your use case.

Inference code

It uses its own pipeline class rather than the standard diffusers pipeline.

import torch
from ultraflux.pipeline_flux import FluxPipeline
from ultraflux.transformer_flux_visionyarn import FluxTransformer2DModel
from ultraflux.autoencoder_kl import AutoencoderKL

# Load components
local_vae = AutoencoderKL.from_pretrained(
    "Owen777/UltraFlux-v1",
    subfolder="vae",
    torch_dtype=torch.bfloat16
)
transformer = FluxTransformer2DModel.from_pretrained(
    "Owen777/UltraFlux-v1",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)

pipe = FluxPipeline.from_pretrained(
    "Owen777/UltraFlux-v1",
    vae=local_vae,
    torch_dtype=torch.bfloat16,
    transformer=transformer
)
pipe.scheduler.config.use_dynamic_shifting = False
pipe.scheduler.config.time_shift = 4
pipe = pipe.to("cuda")

image = pipe(
    prompt="a cat sitting on a windowsill at sunset",
    height=4096,
    width=4096,
    guidance_scale=4,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

image.save("output.jpeg")

FluxTransformer2DModel lives in the visionyarn submodule, and you must explicitly set the scheduler’s time_shift. In other words, it is not compatible with stock FLUX.1-dev. Note that you must use UltraFlux’s own pipeline instead of the standard diffusers FluxPipeline.

Practical caveats

Heavy VRAM requirements

The full FLUX.1-dev model (12B, bf16) needs around 24GB of VRAM. UltraFlux adds its own VAE and VisionYaRN Transformer, so the requirement may be even higher. A 4096×4096 latent space uses 16x the memory of a 1024×1024 one.

An RTX 4090 with 24GB might run it if you use enable_model_cpu_offload(), but generation speed will probably be much slower. For comfortable use, you probably want an RTX A6000 (48GB) or a cloud GPU.

Apple Silicon is rough

FLUX.2 Klein, at 9B and 29GB, is already impractical on Apple Silicon. UltraFlux is even less friendly: 12B plus 4K resolution. Even an M1/M2/M3 Max with 64GB may have enough memory, but generation time will likely be unrealistic.

ComfyUI support is unconfirmed

Only diffusers-based inference scripts are published on GitHub. Native ComfyUI support had not been confirmed as of February 2026.

The ecosystem is still immature

Downloads are around 280 and likes around 169. That is modest compared with Z-Image’s initial traction. There is one Hugging Face Spaces demo, but I have not seen third-party fine-tunes or LoRAs yet.

v1.1 variant

The day after v1 was released, a v1.1 Transformer was published. It is a variant fine-tuned on high-quality synthetic images, with claimed improvements in composition and aesthetic quality. It can be used by swapping only the Transformer.