UltraFlux-v1 - a native 4K image generation model based on FLUX.1-dev
Contents
A model has appeared that pushes FLUX.1-dev all the way toward “native 4K generation”: W2GenAI Lab’s UltraFlux-v1. The paper was published on arXiv in November 2025, and the weights are also available on Hugging Face.
In the FLUX ecosystem, FLUX.2 Klein went in the direction of smaller models, and Schnell focused on distillation for speed. UltraFlux goes in the opposite direction: “make 4096×4096 native.” I had just written about Z-Image, so I wanted to compare the two approaches.
Basic specs
| Item | Details |
|---|---|
| Base model | FLUX.1-dev (Black Forest Labs) |
| Parameters | Based on FLUX.1-dev, about 12B equivalent |
| Supported resolution | Up to 4096×4096, with multiple aspect ratios |
| Inference steps | 50 |
| Guidance scale | 4 |
| Data type | bfloat16 |
| License | Apache 2.0 |
| Paper | arXiv:2511.18050 |
Technical points
UltraFlux’s concept is data-model co-design. Instead of solving each requirement for 4K generation separately, it designs the dataset, architecture, loss function, and training curriculum as a single system.
Resonance 2D RoPE + YaRN
FLUX.1-dev’s position encoding (RoPE) was designed for 1K to 2K resolutions. If you extend it to 4K directly, position information breaks down.
UltraFlux uses Resonance RoPE to stabilize frequency components in the high-resolution range, and YaRN (Yet another RoPE extensioN) to extrapolate context length. It applies a context-extension trick originally used for LLMs to the 2D space of image generation. That keeps position information stable across square, portrait, and ultrawide aspect ratios.
VAE post-training
The standard FLUX VAE is optimized for 1K resolutions. If you compress and reconstruct 4K images with it, fine detail gets lost.
UltraFlux improves the VAE through non-adversarial post-training, without using adversarial learning or GAN loss. The goal is to improve reconstruction of fine textures and edges in 4K images while keeping training stable.
SNR-Aware Huber Wavelet loss
In ordinary diffusion training, gradient balance across noise levels tends to drift. High-frequency components such as textures and edges are especially easy to underweight.
This loss rebalances gradients across frequency bands using wavelet decomposition and weights them according to SNR, or signal-to-noise ratio. Huber loss also reduces the influence of outliers. The result is better sharpness in 4K images.
Stage-wise Aesthetic Curriculum Learning (SACL)
The model uses curriculum learning that starts with diverse data for generalization and then concentrates on high-quality data in the high-noise training stage. Instead of mixing everything blindly, it controls the quality of what the model sees at each stage.
The MultiAspect-4K-1M dataset
The training dataset was released together with the model.
- 1 million native 4K images
- Bilingual captions in English and Chinese
- VLM and IQA metadata
- Sampling designed to balance aspect ratios
The dataset is not fully public yet, but the paper describes its structure in detail.
Position in the FLUX ecosystem
There are now enough FLUX-based models that it helps to line them up.
| Model | Direction | Parameters | Resolution | License |
|---|---|---|---|---|
| FLUX.1 dev/pro | flagship | 12B | up to 2K | dev: non-commercial |
| FLUX.1 Schnell | distilled for speed | 12B | up to 2K | Apache 2.0 |
| FLUX.2 Klein 9B | lighter without distillation | 9B | up to 2K | non-commercial |
| FLUX.2 Klein 4B | even lighter | 4B | up to 2K | non-commercial |
| UltraFlux-v1 | 4K-focused | 12B equivalent | up to 4K | Apache 2.0 |
FLUX.2 Klein means “same quality, smaller model.” UltraFlux means “same size, higher resolution.” The directions are opposite, so the use cases are different too.
Comparison with Z-Image
Z-Image is also being watched as a challenger to FLUX, but the approach is completely different.
| UltraFlux-v1 | Z-Image | |
|---|---|---|
| Base | modified FLUX.1-dev | independent design (S3-DiT) |
| Parameters | 12B equivalent | 6B |
| Max resolution | 4096×4096 | 2048×2048 |
| Minimum VRAM | 24GB+ (estimated) | 6GB with quantization |
| Negative prompt | unsupported, following FLUX | supported |
| LoRA compatibility | unclear for FLUX LoRAs | own ecosystem |
| Strength | 4K generation quality | lightness and parameter efficiency |
Z-Image is an efficiency play: “beat a 12B model with 6B.” UltraFlux is a quality play: “raise the resolution ceiling of a 12B model.” The model you choose depends on your environment and your use case.
Inference code
It uses its own pipeline class rather than the standard diffusers pipeline.
import torch
from ultraflux.pipeline_flux import FluxPipeline
from ultraflux.transformer_flux_visionyarn import FluxTransformer2DModel
from ultraflux.autoencoder_kl import AutoencoderKL
# Load components
local_vae = AutoencoderKL.from_pretrained(
"Owen777/UltraFlux-v1",
subfolder="vae",
torch_dtype=torch.bfloat16
)
transformer = FluxTransformer2DModel.from_pretrained(
"Owen777/UltraFlux-v1",
subfolder="transformer",
torch_dtype=torch.bfloat16
)
pipe = FluxPipeline.from_pretrained(
"Owen777/UltraFlux-v1",
vae=local_vae,
torch_dtype=torch.bfloat16,
transformer=transformer
)
pipe.scheduler.config.use_dynamic_shifting = False
pipe.scheduler.config.time_shift = 4
pipe = pipe.to("cuda")
image = pipe(
prompt="a cat sitting on a windowsill at sunset",
height=4096,
width=4096,
guidance_scale=4,
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("output.jpeg")
FluxTransformer2DModel lives in the visionyarn submodule, and you must explicitly set the scheduler’s time_shift. In other words, it is not compatible with stock FLUX.1-dev. Note that you must use UltraFlux’s own pipeline instead of the standard diffusers FluxPipeline.
Practical caveats
Heavy VRAM requirements
The full FLUX.1-dev model (12B, bf16) needs around 24GB of VRAM. UltraFlux adds its own VAE and VisionYaRN Transformer, so the requirement may be even higher. A 4096×4096 latent space uses 16x the memory of a 1024×1024 one.
An RTX 4090 with 24GB might run it if you use enable_model_cpu_offload(), but generation speed will probably be much slower. For comfortable use, you probably want an RTX A6000 (48GB) or a cloud GPU.
Apple Silicon is rough
FLUX.2 Klein, at 9B and 29GB, is already impractical on Apple Silicon. UltraFlux is even less friendly: 12B plus 4K resolution. Even an M1/M2/M3 Max with 64GB may have enough memory, but generation time will likely be unrealistic.
ComfyUI support is unconfirmed
Only diffusers-based inference scripts are published on GitHub. Native ComfyUI support had not been confirmed as of February 2026.
The ecosystem is still immature
Downloads are around 280 and likes around 169. That is modest compared with Z-Image’s initial traction. There is one Hugging Face Spaces demo, but I have not seen third-party fine-tunes or LoRAs yet.
v1.1 variant
The day after v1 was released, a v1.1 Transformer was published. It is a variant fine-tuned on high-quality synthetic images, with claimed improvements in composition and aesthetic quality. It can be used by swapping only the Transformer.
Links
- Hugging Face: Owen777/UltraFlux-v1
- GitHub: W2GenAI-Lab/UltraFlux
- arXiv: 2511.18050
- Tech report (PDF)