Tech 8 min read

How to get ComfyUI running on Blackwell GPUs, and what can go wrong

IkesanContents

If you buy a Blackwell-generation GPU such as an RTX 5090 or RTX PRO 6000 and try to run ComfyUI, you may hit CUDA error: no kernel image is available for execution on the device. The flood of Reddit and GitHub Discussions posts is not caused by broken hardware; the stable PyTorch release just does not yet support Blackwell’s Compute Capability sm_120.

I ran into the same issue with the RTX PRO 6000 in the Qwen-Image-Layered article, where ComfyUI broke during VAE decoding and I switched to diffusers. At the time I chalked it up to “waiting for ComfyUI support,” but the environment actually works if you move to PyTorch Nightly and remove xformers.

Why ComfyUI fails on Blackwell

sm_120 and CUDA build mismatch

CUDA distributes GPU code in two forms: binary cubin objects and assembly-style PTX. A cubin is tied to a specific Compute Capability, so an sm_90 build for Hopper will not run on sm_120 Blackwell hardware. If PTX is included, the driver can JIT-compile it, but the stable PyTorch releases (2.5 to 2.7) still lack the sm_120 PTX path or block it through an architecture check.

That leads to one of two symptoms:

  • immediate death with CUDA error: no kernel image is available for execution on the device
  • CPU fallback after GPU execution is abandoned, which is far too slow to use

Blackwell hardware specs

RTX PRO 6000 Blackwell Workstation Edition:

ItemSpec
VRAM96 GB GDDR7
Memory bus512-bit
Bandwidthabout 1.8 TB/s
Tensor cores5th generation
Native precisionFP4 support
TDP600W (300W in Max-Q)

The consumer RTX 5090 / 5080 / 5070 uses the same sm_120 architecture, so the compatibility issue is shared.

The xformers trap

The most annoying issue on Blackwell is what I call the xformers trap. Even after you build a correct PyTorch Nightly environment, GPU support can suddenly disappear.

What happens

graph TD
    A[Build Blackwell environment with<br/>PyTorch Nightly cu130] --> B[Install custom nodes]
    B --> C[xformers appears in requirements.txt]
    C --> D[pip dependency resolver runs]
    D --> E[xformers requires stable PyTorch]
    E --> F[PyTorch Nightly gets uninstalled]
    F --> G[pip downgrades to stable PyTorch]
    G --> H[sm_120 compatibility is lost]
    H --> I[environment collapses]

As of early 2026, the prebuilt xformers wheels on PyPI only support up to RTX 40-series (Ada Lovelace, sm_89). If a custom node pulls xformers in as a dependency, pip can silently uninstall Nightly and downgrade the environment.

How to deal with it

  • always add --disable-xformers when starting
  • manually remove xformers from custom-node requirements.txt
  • use --disable-all-custom-nodes and --whitelist-custom-nodes when isolating problems
  • append .disabled to a problematic node directory to quarantine it

Attention options

If you remove xformers, you still need an attention backend for Cross-Attention and Self-Attention. On Blackwell there are three realistic choices.

Attention mechanismBlackwell compatibilitySetupNotes
PyTorch SDPAfully compatible--use-pytorch-cross-attentioneasiest and stable, slightly slower than Sage
SageAttentionneeds a custom wheeluse the KJNodes Patch nodeabout 30-35% faster, but build and version requirements are strict
FlashAttention-4compatible on Linuxpip install flash-attn-4 (Linux)theoretically fastest, but hard to install on native Windows

PyTorch SDPA

torch.nn.functional.scaled_dot_product_attention is built into PyTorch, so it works on sm_120 without extra libraries. Just add --use-pytorch-cross-attention. With torch.compile, performance gets close to specialized libraries.

SageAttention

SageAttention 2.2.0 / 3.0 applies INT8 quantization to Query and Key while keeping Value at 16-bit precision. It can be 30-35% faster than SDPA, but Windows setup is painful.

Windows installation pain

  1. SageAttention depends on Triton, and the standard Triton build is Linux-only. Windows needs the triton-windows fork.
  2. A source build with CUDA 13.0 and MSVC usually fails because of namespace conflicts in the PyTorch headers (MSVC C2872: 'std' ambiguous symbol).
  3. You have to use prebuilt wheels, but if the Nightly build date is even one day off, you get DLL load failed.

Global enablement can produce black images

Turning on --use-sage-attention globally can produce black outputs in some models such as Wan 2.1 and Qwen.

Use KJNodes at the node level

The safer way is to use ComfyUI-KJNodes’ Patch Sage Attention KJ node.

  1. Remove --use-sage-attention from startup, so ComfyUI uses default PyTorch attention
  2. Insert the Patch Sage Attention KJ node right after model loading
  3. Set the backend to sageattn_qk_int8_pv_fp16_cuda

You may still see Using xformers attention in VAE in the console. That is only a VAE encode/decode fallback notice, not a sign that the environment regressed.

NVFP4 quantization cuts VRAM dramatically

One of Blackwell’s headline features is native 4-bit floating point support, called NVFP4. Unlike ordinary post-training quantization, it is a hardware-accelerated precision format designed for AI workloads.

VRAM and throughput

For a 14B model:

PrecisionVRAM use
FP16about 28 GB
FP8about 14 GB
NVFP4about 6.8 to 7.5 GB

Inference is 200-300% faster than FP16. In real tests, FLUX.1-dev image generation dropped from over 40 seconds in BF16 to about 12 seconds.

The comfy-kitchen backend

The standard ComfyUI model loader does not know how to hand 4-bit weights directly to Tensor Cores. To use NVFP4, you need comfy-kitchen, a library from Comfy-Org.

pip install comfy-kitchen[cublas]

CUDA 13.0 (cu130) is required.

How to verify NVFP4 is actually working

If comfy-kitchen is missing or fails to initialize, you will see:

  • weight_dtype stays at default, forcing FP16 and using about 28 GB, which often leads to OOM
  • weight_dtype becomes fp8_e4m3fn, which upcasts the 4-bit weights to 8-bit and uses about 14 GB, so you lose the NVFP4 benefit

When it is working correctly:

  • the console shows dequantize_nvfp4 and quantize_nvfp4
  • you do not see manual cast: torch.float16 warnings during model load
  • TF32 enabled, cuDNN benchmark active appears

Driver stability

On RTX PRO 6000 or 5090, heavy workloads that saturate VRAM, such as high-resolution video generation, can blank the display and crash the driver. In some cases a normal Windows reboot is not enough and you need a physical power-cycle.

BranchVersionNotes
Production Branch R580582.16certified stable for RTX PRO 6000; clean install recommended for multi-GPU systems
Production Branch R595595.71 (March 2, 2026)fixes crash issues in Blackwell Max-Q multi-GPU setups

Use DDU (Display Driver Uninstaller) to clean out old drivers before installing.

On Proxmox GPU passthrough

If you are passing the GPU through with VFIO on Linux, host crashes can happen. The workaround is vfio_iommu_type1 allow_unsafe_interrupts=1. If the vBIOS is old, update it through NVIDIA support.

ComfyUI VRAM flags

FlagEffect
--highvramkeep the model in VRAM; on 96 GB cards this is basically required
--async-offloadasynchronously move idle tensors back to system RAM
--pin-shared-memorylock RAM pages to avoid disk-swap latency
--disable-cuda-mallocuse PyTorch’s native allocator; after Blackwell patches this can be more stable
--fast-fp16optimize FP16 math

Deployment choices

Docker on Linux / WSL2

If isolation matters most, Docker is the best option. ChiefNakor/comfyui-blackwell-docker provides a Blackwell-optimized image.

The benefits are strong:

  • automated dependency resolution for cu130 PyTorch
  • precompiled SageAttention
  • comfy-kitchen integrated
  • if a custom node tries to pull xformers, the host environment stays clean
  • if it breaks, you just throw away the container and rebuild

Native Windows

If you want to avoid WSL2 overhead, hiroki-abe-58/ComfyUI-Win-Blackwell is a useful reference.

Requirements:

  • pin the Python version to match the SageAttention wheel, usually 3.11 or 3.13
  • explicitly exclude xformers and use SDPA / triton-windows instead
  • manually audit dependencies when adding custom nodes

Setup steps

Driver and base environment

  1. Clean old drivers with DDU
  2. Clean-install the Production Branch driver (582.16 or 595.71)
  3. Install CUDA Toolkit 13.0
  4. Create a Python 3.11 or 3.13 virtual environment

PyTorch and ComfyUI

# Install PyTorch Nightly (cu130)
pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu130

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git

Remove xformers from ComfyUI itself and from any custom-node requirements.txt.

Acceleration libraries

# comfy-kitchen for NVFP4
pip install comfy-kitchen[cublas]

# triton-windows for Windows
pip install triton-windows

# SageAttention wheel for sm_120
pip install sageattention-2.2.0+cu130.torch2.11-cp311-cp311-win_amd64.whl

# KJNodes for node-level SageAttention
cd custom_nodes
git clone https://github.com/kijai/ComfyUI-KJNodes.git

Startup flags

python main.py ^
  --disable-xformers ^
  --use-pytorch-cross-attention ^
  --highvram ^
  --async-offload ^
  --pin-shared-memory

Build the workflow

  1. Load an NVFP4-quantized checkpoint such as Wan 2.2 or FLUX.2
  2. Connect the model output to the Patch Sage Attention KJ node
  3. Set the backend to sageattn_qk_int8_pv_fp16_cuda
  4. Feed the patched model into KSampler