UltraFlux-v1 - a native 4K image generation model based on FLUX.1-dev

A model has appeared that pushes FLUX.1-dev all the way toward “native 4K generation”: W2GenAI Lab’s UltraFlux-v1. The paper was published on arXiv in November 2025, and the weights are also available on Hugging Face.

In the FLUX ecosystem, FLUX.2 Klein went in the direction of smaller models, and Schnell focused on distillation for speed. UltraFlux goes in the opposite direction: “make 4096×4096 native.” I had just written about Z-Image, so I wanted to compare the two approaches.

Basic specs

Item	Details
Base model	FLUX.1-dev (Black Forest Labs)
Parameters	Based on FLUX.1-dev, about 12B equivalent
Supported resolution	Up to 4096×4096, with multiple aspect ratios
Inference steps	50
Guidance scale	4
Data type	bfloat16
License	Apache 2.0
Paper	arXiv:2511.18050

Technical points

UltraFlux’s concept is data-model co-design. Instead of solving each requirement for 4K generation separately, it designs the dataset, architecture, loss function, and training curriculum as a single system.

Resonance 2D RoPE + YaRN

FLUX.1-dev’s position encoding (RoPE) was designed for 1K to 2K resolutions. If you extend it to 4K directly, position information breaks down.

UltraFlux uses Resonance RoPE to stabilize frequency components in the high-resolution range, and YaRN (Yet another RoPE extensioN) to extrapolate context length. It applies a context-extension trick originally used for LLMs to the 2D space of image generation. That keeps position information stable across square, portrait, and ultrawide aspect ratios.

VAE post-training

The standard FLUX VAE is optimized for 1K resolutions. If you compress and reconstruct 4K images with it, fine detail gets lost.

UltraFlux improves the VAE through non-adversarial post-training, without using adversarial learning or GAN loss. The goal is to improve reconstruction of fine textures and edges in 4K images while keeping training stable.

SNR-Aware Huber Wavelet loss

In ordinary diffusion training, gradient balance across noise levels tends to drift. High-frequency components such as textures and edges are especially easy to underweight.

This loss rebalances gradients across frequency bands using wavelet decomposition and weights them according to SNR, or signal-to-noise ratio. Huber loss also reduces the influence of outliers. The result is better sharpness in 4K images.

Stage-wise Aesthetic Curriculum Learning (SACL)

The model uses curriculum learning that starts with diverse data for generalization and then concentrates on high-quality data in the high-noise training stage. Instead of mixing everything blindly, it controls the quality of what the model sees at each stage.

The MultiAspect-4K-1M dataset

The training dataset was released together with the model.

1 million native 4K images
Bilingual captions in English and Chinese
VLM and IQA metadata
Sampling designed to balance aspect ratios

The dataset is not fully public yet, but the paper describes its structure in detail.

Position in the FLUX ecosystem

There are now enough FLUX-based models that it helps to line them up.

Model	Direction	Parameters	Resolution	License
FLUX.1 dev/pro	flagship	12B	up to 2K	dev: non-commercial
FLUX.1 Schnell	distilled for speed	12B	up to 2K	Apache 2.0
FLUX.2 Klein 9B	lighter without distillation	9B	up to 2K	non-commercial
FLUX.2 Klein 4B	even lighter	4B	up to 2K	non-commercial
UltraFlux-v1	4K-focused	12B equivalent	up to 4K	Apache 2.0

FLUX.2 Klein means “same quality, smaller model.” UltraFlux means “same size, higher resolution.” The directions are opposite, so the use cases are different too.

Comparison with Z-Image

Z-Image is also being watched as a challenger to FLUX, but the approach is completely different.

	UltraFlux-v1	Z-Image
Base	modified FLUX.1-dev	independent design (S3-DiT)
Parameters	12B equivalent	6B
Max resolution	4096×4096	2048×2048
Minimum VRAM	24GB+ (estimated)	6GB with quantization
Negative prompt	unsupported, following FLUX	supported
LoRA compatibility	unclear for FLUX LoRAs	own ecosystem
Strength	4K generation quality	lightness and parameter efficiency

Z-Image is an efficiency play: “beat a 12B model with 6B.” UltraFlux is a quality play: “raise the resolution ceiling of a 12B model.” The model you choose depends on your environment and your use case.

Inference code

It uses its own pipeline class rather than the standard diffusers pipeline.

import torch
from ultraflux.pipeline_flux import FluxPipeline
from ultraflux.transformer_flux_visionyarn import FluxTransformer2DModel
from ultraflux.autoencoder_kl import AutoencoderKL

# Load components
local_vae = AutoencoderKL.from_pretrained(
    "Owen777/UltraFlux-v1",
    subfolder="vae",
    torch_dtype=torch.bfloat16
)
transformer = FluxTransformer2DModel.from_pretrained(
    "Owen777/UltraFlux-v1",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)

pipe = FluxPipeline.from_pretrained(
    "Owen777/UltraFlux-v1",
    vae=local_vae,
    torch_dtype=torch.bfloat16,
    transformer=transformer
)
pipe.scheduler.config.use_dynamic_shifting = False
pipe.scheduler.config.time_shift = 4
pipe = pipe.to("cuda")

image = pipe(
    prompt="a cat sitting on a windowsill at sunset",
    height=4096,
    width=4096,
    guidance_scale=4,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

image.save("output.jpeg")

FluxTransformer2DModel lives in the visionyarn submodule, and you must explicitly set the scheduler’s time_shift. In other words, it is not compatible with stock FLUX.1-dev. Note that you must use UltraFlux’s own pipeline instead of the standard diffusers FluxPipeline.

Practical caveats

Heavy VRAM requirements

The full FLUX.1-dev model (12B, bf16) needs around 24GB of VRAM. UltraFlux adds its own VAE and VisionYaRN Transformer, so the requirement may be even higher. A 4096×4096 latent space uses 16x the memory of a 1024×1024 one.

An RTX 4090 with 24GB might run it if you use enable_model_cpu_offload(), but generation speed will probably be much slower. For comfortable use, you probably want an RTX A6000 (48GB) or a cloud GPU.

Apple Silicon is rough

FLUX.2 Klein, at 9B and 29GB, is already impractical on Apple Silicon. UltraFlux is even less friendly: 12B plus 4K resolution. Even an M1/M2/M3 Max with 64GB may have enough memory, but generation time will likely be unrealistic.

ComfyUI support is unconfirmed

Only diffusers-based inference scripts are published on GitHub. Native ComfyUI support had not been confirmed as of February 2026.

The ecosystem is still immature

Downloads are around 280 and likes around 169. That is modest compared with Z-Image’s initial traction. There is one Hugging Face Spaces demo, but I have not seen third-party fine-tunes or LoRAs yet.

v1.1 variant

The day after v1 was released, a v1.1 Transformer was published. It is a variant fine-tuned on high-quality synthetic images, with claimed improvements in composition and aesthetic quality. It can be used by swapping only the Transformer.