ComfyUI on Blackwell GPUs (RTX 5090 / RTX PRO 6000): why sm_120 fails and the PyTorch Nightly fix that works

If you buy a Blackwell-generation GPU such as an RTX 5090 or RTX PRO 6000 and try to run ComfyUI, you may hit CUDA error: no kernel image is available for execution on the device. The flood of Reddit and GitHub Discussions posts is not caused by broken hardware; the stable PyTorch release just does not yet support Blackwell’s Compute Capability sm_120.

I ran into the same issue with the RTX PRO 6000 in the Qwen-Image-Layered article, where ComfyUI broke during VAE decoding and I switched to diffusers. At the time I chalked it up to “waiting for ComfyUI support,” but the environment actually works if you move to PyTorch Nightly and remove xformers.

Why ComfyUI fails on Blackwell

sm_120 and CUDA build mismatch

CUDA distributes GPU code in two forms: binary cubin objects and assembly-style PTX. A cubin is tied to a specific Compute Capability, so an sm_90 build for Hopper will not run on sm_120 Blackwell hardware. If PTX is included, the driver can JIT-compile it, but the stable PyTorch releases (2.5 to 2.7) still lack the sm_120 PTX path or block it through an architecture check.

That leads to one of two symptoms:

immediate death with CUDA error: no kernel image is available for execution on the device
CPU fallback after GPU execution is abandoned, which is far too slow to use

Blackwell hardware specs

RTX PRO 6000 Blackwell Workstation Edition:

Item	Spec
VRAM	96 GB GDDR7
Memory bus	512-bit
Bandwidth	about 1.8 TB/s
Tensor cores	5th generation
Native precision	FP4 support
TDP	600W (300W in Max-Q)

The consumer RTX 5090 / 5080 / 5070 uses the same sm_120 architecture, so the compatibility issue is shared.

The xformers trap

The most annoying issue on Blackwell is what I call the xformers trap. Even after you build a correct PyTorch Nightly environment, GPU support can suddenly disappear.

What happens

graph TD
    A[Build Blackwell environment with<br/>PyTorch Nightly cu130] --> B[Install custom nodes]
    B --> C[xformers appears in requirements.txt]
    C --> D[pip dependency resolver runs]
    D --> E[xformers requires stable PyTorch]
    E --> F[PyTorch Nightly gets uninstalled]
    F --> G[pip downgrades to stable PyTorch]
    G --> H[sm_120 compatibility is lost]
    H --> I[environment collapses]

As of early 2026, the prebuilt xformers wheels on PyPI only support up to RTX 40-series (Ada Lovelace, sm_89). If a custom node pulls xformers in as a dependency, pip can silently uninstall Nightly and downgrade the environment.

How to deal with it

always add --disable-xformers when starting
manually remove xformers from custom-node requirements.txt
use --disable-all-custom-nodes and --whitelist-custom-nodes when isolating problems
append .disabled to a problematic node directory to quarantine it

Attention options

If you remove xformers, you still need an attention backend for Cross-Attention and Self-Attention. On Blackwell there are three realistic choices.

Attention mechanism	Blackwell compatibility	Setup	Notes
PyTorch SDPA	fully compatible	`--use-pytorch-cross-attention`	easiest and stable, slightly slower than Sage
SageAttention	needs a custom wheel	use the KJNodes Patch node	about 30-35% faster, but build and version requirements are strict
FlashAttention-4	compatible on Linux	`pip install flash-attn-4` (Linux)	theoretically fastest, but hard to install on native Windows

PyTorch SDPA

torch.nn.functional.scaled_dot_product_attention is built into PyTorch, so it works on sm_120 without extra libraries. Just add --use-pytorch-cross-attention. With torch.compile, performance gets close to specialized libraries.

SageAttention

SageAttention 2.2.0 / 3.0 applies INT8 quantization to Query and Key while keeping Value at 16-bit precision. It can be 30-35% faster than SDPA, but Windows setup is painful.

Windows installation pain

SageAttention depends on Triton, and the standard Triton build is Linux-only. Windows needs the triton-windows fork.
A source build with CUDA 13.0 and MSVC usually fails because of namespace conflicts in the PyTorch headers (MSVC C2872: 'std' ambiguous symbol).
You have to use prebuilt wheels, but if the Nightly build date is even one day off, you get DLL load failed.

Global enablement can produce black images

Turning on --use-sage-attention globally can produce black outputs in some models such as Wan 2.1 and Qwen.

Use KJNodes at the node level

The safer way is to use ComfyUI-KJNodes’ Patch Sage Attention KJ node.

Remove --use-sage-attention from startup, so ComfyUI uses default PyTorch attention
Insert the Patch Sage Attention KJ node right after model loading
Set the backend to sageattn_qk_int8_pv_fp16_cuda

You may still see Using xformers attention in VAE in the console. That is only a VAE encode/decode fallback notice, not a sign that the environment regressed.

NVFP4 quantization cuts VRAM dramatically

One of Blackwell’s headline features is native 4-bit floating point support, called NVFP4. Unlike ordinary post-training quantization, it is a hardware-accelerated precision format designed for AI workloads.

VRAM and throughput

For a 14B model:

Precision	VRAM use
FP16	about 28 GB
FP8	about 14 GB
NVFP4	about 6.8 to 7.5 GB

Inference is 200-300% faster than FP16. In real tests, FLUX.1-dev image generation dropped from over 40 seconds in BF16 to about 12 seconds.

The `comfy-kitchen` backend

The standard ComfyUI model loader does not know how to hand 4-bit weights directly to Tensor Cores. To use NVFP4, you need comfy-kitchen, a library from Comfy-Org.

pip install comfy-kitchen[cublas]

CUDA 13.0 (cu130) is required.

How to verify NVFP4 is actually working

If comfy-kitchen is missing or fails to initialize, you will see:

weight_dtype stays at default, forcing FP16 and using about 28 GB, which often leads to OOM
weight_dtype becomes fp8_e4m3fn, which upcasts the 4-bit weights to 8-bit and uses about 14 GB, so you lose the NVFP4 benefit

When it is working correctly:

the console shows dequantize_nvfp4 and quantize_nvfp4
you do not see manual cast: torch.float16 warnings during model load
TF32 enabled, cuDNN benchmark active appears

Driver stability

On RTX PRO 6000 or 5090, heavy workloads that saturate VRAM, such as high-resolution video generation, can blank the display and crash the driver. In some cases a normal Windows reboot is not enough and you need a physical power-cycle.

Recommended drivers

Branch	Version	Notes
Production Branch R580	582.16	certified stable for RTX PRO 6000; clean install recommended for multi-GPU systems
Production Branch R595	595.71 (March 2, 2026)	fixes crash issues in Blackwell Max-Q multi-GPU setups

Use DDU (Display Driver Uninstaller) to clean out old drivers before installing.

On Proxmox GPU passthrough

If you are passing the GPU through with VFIO on Linux, host crashes can happen. The workaround is vfio_iommu_type1 allow_unsafe_interrupts=1. If the vBIOS is old, update it through NVIDIA support.

ComfyUI VRAM flags

Flag	Effect
`--highvram`	keep the model in VRAM; on 96 GB cards this is basically required
`--async-offload`	asynchronously move idle tensors back to system RAM
`--pin-shared-memory`	lock RAM pages to avoid disk-swap latency
`--disable-cuda-malloc`	use PyTorch’s native allocator; after Blackwell patches this can be more stable
`--fast-fp16`	optimize FP16 math

Deployment choices

Docker on Linux / WSL2

If isolation matters most, Docker is the best option. ChiefNakor/comfyui-blackwell-docker provides a Blackwell-optimized image.

The benefits are strong:

automated dependency resolution for cu130 PyTorch
precompiled SageAttention
comfy-kitchen integrated
if a custom node tries to pull xformers, the host environment stays clean
if it breaks, you just throw away the container and rebuild

Native Windows

If you want to avoid WSL2 overhead, hiroki-abe-58/ComfyUI-Win-Blackwell is a useful reference.

Requirements:

pin the Python version to match the SageAttention wheel, usually 3.11 or 3.13
explicitly exclude xformers and use SDPA / triton-windows instead
manually audit dependencies when adding custom nodes

Setup steps

Driver and base environment

Clean old drivers with DDU
Clean-install the Production Branch driver (582.16 or 595.71)
Install CUDA Toolkit 13.0
Create a Python 3.11 or 3.13 virtual environment

PyTorch and ComfyUI

# Install PyTorch Nightly (cu130)
pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu130

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git

Remove xformers from ComfyUI itself and from any custom-node requirements.txt.

Acceleration libraries

# comfy-kitchen for NVFP4
pip install comfy-kitchen[cublas]

# triton-windows for Windows
pip install triton-windows

# SageAttention wheel for sm_120
pip install sageattention-2.2.0+cu130.torch2.11-cp311-cp311-win_amd64.whl

# KJNodes for node-level SageAttention
cd custom_nodes
git clone https://github.com/kijai/ComfyUI-KJNodes.git

Startup flags

python main.py ^
  --disable-xformers ^
  --use-pytorch-cross-attention ^
  --highvram ^
  --async-offload ^
  --pin-shared-memory

Build the workflow

Load an NVFP4-quantized checkpoint such as Wan 2.2 or FLUX.2
Connect the model output to the Patch Sage Attention KJ node
Set the backend to sageattn_qk_int8_pv_fp16_cuda
Feed the patched model into KSampler