How to get ComfyUI running on Blackwell GPUs, and what can go wrong
Contents
If you buy a Blackwell-generation GPU such as an RTX 5090 or RTX PRO 6000 and try to run ComfyUI, you may hit CUDA error: no kernel image is available for execution on the device. The flood of Reddit and GitHub Discussions posts is not caused by broken hardware; the stable PyTorch release just does not yet support Blackwell’s Compute Capability sm_120.
I ran into the same issue with the RTX PRO 6000 in the Qwen-Image-Layered article, where ComfyUI broke during VAE decoding and I switched to diffusers. At the time I chalked it up to “waiting for ComfyUI support,” but the environment actually works if you move to PyTorch Nightly and remove xformers.
Why ComfyUI fails on Blackwell
sm_120 and CUDA build mismatch
CUDA distributes GPU code in two forms: binary cubin objects and assembly-style PTX. A cubin is tied to a specific Compute Capability, so an sm_90 build for Hopper will not run on sm_120 Blackwell hardware. If PTX is included, the driver can JIT-compile it, but the stable PyTorch releases (2.5 to 2.7) still lack the sm_120 PTX path or block it through an architecture check.
That leads to one of two symptoms:
- immediate death with
CUDA error: no kernel image is available for execution on the device - CPU fallback after GPU execution is abandoned, which is far too slow to use
Blackwell hardware specs
RTX PRO 6000 Blackwell Workstation Edition:
| Item | Spec |
|---|---|
| VRAM | 96 GB GDDR7 |
| Memory bus | 512-bit |
| Bandwidth | about 1.8 TB/s |
| Tensor cores | 5th generation |
| Native precision | FP4 support |
| TDP | 600W (300W in Max-Q) |
The consumer RTX 5090 / 5080 / 5070 uses the same sm_120 architecture, so the compatibility issue is shared.
The xformers trap
The most annoying issue on Blackwell is what I call the xformers trap. Even after you build a correct PyTorch Nightly environment, GPU support can suddenly disappear.
What happens
graph TD
A[Build Blackwell environment with<br/>PyTorch Nightly cu130] --> B[Install custom nodes]
B --> C[xformers appears in requirements.txt]
C --> D[pip dependency resolver runs]
D --> E[xformers requires stable PyTorch]
E --> F[PyTorch Nightly gets uninstalled]
F --> G[pip downgrades to stable PyTorch]
G --> H[sm_120 compatibility is lost]
H --> I[environment collapses]
As of early 2026, the prebuilt xformers wheels on PyPI only support up to RTX 40-series (Ada Lovelace, sm_89). If a custom node pulls xformers in as a dependency, pip can silently uninstall Nightly and downgrade the environment.
How to deal with it
- always add
--disable-xformerswhen starting - manually remove xformers from custom-node
requirements.txt - use
--disable-all-custom-nodesand--whitelist-custom-nodeswhen isolating problems - append
.disabledto a problematic node directory to quarantine it
Attention options
If you remove xformers, you still need an attention backend for Cross-Attention and Self-Attention. On Blackwell there are three realistic choices.
| Attention mechanism | Blackwell compatibility | Setup | Notes |
|---|---|---|---|
| PyTorch SDPA | fully compatible | --use-pytorch-cross-attention | easiest and stable, slightly slower than Sage |
| SageAttention | needs a custom wheel | use the KJNodes Patch node | about 30-35% faster, but build and version requirements are strict |
| FlashAttention-4 | compatible on Linux | pip install flash-attn-4 (Linux) | theoretically fastest, but hard to install on native Windows |
PyTorch SDPA
torch.nn.functional.scaled_dot_product_attention is built into PyTorch, so it works on sm_120 without extra libraries. Just add --use-pytorch-cross-attention. With torch.compile, performance gets close to specialized libraries.
SageAttention
SageAttention 2.2.0 / 3.0 applies INT8 quantization to Query and Key while keeping Value at 16-bit precision. It can be 30-35% faster than SDPA, but Windows setup is painful.
Windows installation pain
- SageAttention depends on Triton, and the standard Triton build is Linux-only. Windows needs the
triton-windowsfork. - A source build with CUDA 13.0 and MSVC usually fails because of namespace conflicts in the PyTorch headers (
MSVC C2872: 'std' ambiguous symbol). - You have to use prebuilt wheels, but if the Nightly build date is even one day off, you get
DLL load failed.
Global enablement can produce black images
Turning on --use-sage-attention globally can produce black outputs in some models such as Wan 2.1 and Qwen.
Use KJNodes at the node level
The safer way is to use ComfyUI-KJNodes’ Patch Sage Attention KJ node.
- Remove
--use-sage-attentionfrom startup, so ComfyUI uses default PyTorch attention - Insert the Patch Sage Attention KJ node right after model loading
- Set the backend to
sageattn_qk_int8_pv_fp16_cuda
You may still see Using xformers attention in VAE in the console. That is only a VAE encode/decode fallback notice, not a sign that the environment regressed.
NVFP4 quantization cuts VRAM dramatically
One of Blackwell’s headline features is native 4-bit floating point support, called NVFP4. Unlike ordinary post-training quantization, it is a hardware-accelerated precision format designed for AI workloads.
VRAM and throughput
For a 14B model:
| Precision | VRAM use |
|---|---|
| FP16 | about 28 GB |
| FP8 | about 14 GB |
| NVFP4 | about 6.8 to 7.5 GB |
Inference is 200-300% faster than FP16. In real tests, FLUX.1-dev image generation dropped from over 40 seconds in BF16 to about 12 seconds.
The comfy-kitchen backend
The standard ComfyUI model loader does not know how to hand 4-bit weights directly to Tensor Cores. To use NVFP4, you need comfy-kitchen, a library from Comfy-Org.
pip install comfy-kitchen[cublas]
CUDA 13.0 (cu130) is required.
How to verify NVFP4 is actually working
If comfy-kitchen is missing or fails to initialize, you will see:
weight_dtypestays atdefault, forcing FP16 and using about 28 GB, which often leads to OOMweight_dtypebecomesfp8_e4m3fn, which upcasts the 4-bit weights to 8-bit and uses about 14 GB, so you lose the NVFP4 benefit
When it is working correctly:
- the console shows
dequantize_nvfp4andquantize_nvfp4 - you do not see
manual cast: torch.float16warnings during model load TF32 enabled, cuDNN benchmark activeappears
Driver stability
On RTX PRO 6000 or 5090, heavy workloads that saturate VRAM, such as high-resolution video generation, can blank the display and crash the driver. In some cases a normal Windows reboot is not enough and you need a physical power-cycle.
Recommended drivers
| Branch | Version | Notes |
|---|---|---|
| Production Branch R580 | 582.16 | certified stable for RTX PRO 6000; clean install recommended for multi-GPU systems |
| Production Branch R595 | 595.71 (March 2, 2026) | fixes crash issues in Blackwell Max-Q multi-GPU setups |
Use DDU (Display Driver Uninstaller) to clean out old drivers before installing.
On Proxmox GPU passthrough
If you are passing the GPU through with VFIO on Linux, host crashes can happen. The workaround is vfio_iommu_type1 allow_unsafe_interrupts=1. If the vBIOS is old, update it through NVIDIA support.
ComfyUI VRAM flags
| Flag | Effect |
|---|---|
--highvram | keep the model in VRAM; on 96 GB cards this is basically required |
--async-offload | asynchronously move idle tensors back to system RAM |
--pin-shared-memory | lock RAM pages to avoid disk-swap latency |
--disable-cuda-malloc | use PyTorch’s native allocator; after Blackwell patches this can be more stable |
--fast-fp16 | optimize FP16 math |
Deployment choices
Docker on Linux / WSL2
If isolation matters most, Docker is the best option. ChiefNakor/comfyui-blackwell-docker provides a Blackwell-optimized image.
The benefits are strong:
- automated dependency resolution for cu130 PyTorch
- precompiled SageAttention
- comfy-kitchen integrated
- if a custom node tries to pull xformers, the host environment stays clean
- if it breaks, you just throw away the container and rebuild
Native Windows
If you want to avoid WSL2 overhead, hiroki-abe-58/ComfyUI-Win-Blackwell is a useful reference.
Requirements:
- pin the Python version to match the SageAttention wheel, usually 3.11 or 3.13
- explicitly exclude xformers and use SDPA / triton-windows instead
- manually audit dependencies when adding custom nodes
Setup steps
Driver and base environment
- Clean old drivers with DDU
- Clean-install the Production Branch driver (582.16 or 595.71)
- Install CUDA Toolkit 13.0
- Create a Python 3.11 or 3.13 virtual environment
PyTorch and ComfyUI
# Install PyTorch Nightly (cu130)
pip install --pre torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/nightly/cu130
# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
Remove xformers from ComfyUI itself and from any custom-node requirements.txt.
Acceleration libraries
# comfy-kitchen for NVFP4
pip install comfy-kitchen[cublas]
# triton-windows for Windows
pip install triton-windows
# SageAttention wheel for sm_120
pip install sageattention-2.2.0+cu130.torch2.11-cp311-cp311-win_amd64.whl
# KJNodes for node-level SageAttention
cd custom_nodes
git clone https://github.com/kijai/ComfyUI-KJNodes.git
Startup flags
python main.py ^
--disable-xformers ^
--use-pytorch-cross-attention ^
--highvram ^
--async-offload ^
--pin-shared-memory
Build the workflow
- Load an NVFP4-quantized checkpoint such as Wan 2.2 or FLUX.2
- Connect the model output to the Patch Sage Attention KJ node
- Set the backend to
sageattn_qk_int8_pv_fp16_cuda - Feed the patched model into KSampler