Tech 5 min read

Running ComfyUI + WAI-Illustrious on an RTX 4060 Laptop (8GB VRAM)

I tried running WAI-Illustrious (SDXL-based) on my laptop with an RTX 4060 Laptop / 8GB VRAM. I expected 8GB to be cutting it close, but it turns out 1024x1024 generation works fine without the lowvram flag, even with a LoRA loaded.

Environment

ItemSpec
GPUNVIDIA GeForce RTX 4060 Laptop (VRAM 8GB)
OSWindows 11 Home
Driver576.02
CUDA12.9
ComfyUIv0.15.0 (portable, CUDA 12.8 build)
ModelWAI-Illustrious SDXL v16.0 (6.5GB)

Setup

Download the NVIDIA CUDA 12.8 build (ComfyUI_windows_portable_nvidia_cu128.7z) from ComfyUI Releases and extract it. It runs fine on CUDA 12.9 with the 12.8 build.

Extract to C:\works\ComfyUI_windows_portable\. Avoid spaces or non-ASCII characters in the path.

Download the checkpoint from Civitai — v16.0 (waiIllustriousSDXL_v160.safetensors, 6.5GB) — and place it in ComfyUI\models\checkpoints\.

run_nvidia_gpu.bat won’t work unless the current directory is ComfyUI_windows_portable\ (it uses relative paths internally). Either double-click it from Explorer or cd into the directory first.

Workflow

Basic setup: Checkpoint Loader -> CLIP Text Encoder -> KSampler -> VAE Decode -> Save.

Basic workflow

When using a LoRA, insert a LoRA Loader between the Checkpoint Loader and CLIP Encoder. This applies the LoRA to both the MODEL and CLIP outputs.

LoRA workflow

Benchmark

Common settings: euler_ancestral / Karras scheduler / CFG 5 / 20 steps / denoise 1.0 Positive: 1girl, general, masterpiece, best quality, amazing quality, Negative: bad quality, worst quality, worst detail, sketch, censor,

Checkpoint Only

Resolutionit/sKSamplerTotalVRAM Peak
512x5125.903s4.90s~4.9GB
768x7682.458s11.24s~4.9GB
1024x10241.4713s15.81s~5.6GB

All resolutions ran with full load: True (entire model loaded into VRAM). --lowvram was not needed.

1024x1024:

1024x1024 result 1

1024x1024 result 2

768x768:

768x768 result

512x512:

512x512 result

With LoRA

Resolutionit/sTotalWithout LoRA
512x5125.835.98s4.90s
768x7682.827.99s11.24s
1024x10241.6315.62s15.81s

The first LoRA load took 10.37s at 512x512, but once cached it dropped to 5.98s. The KSampler it/s is essentially the same — LoRA overhead is minimal.

LoRA 512x512

LoRA 768x768

LoRA 1024x1024

Character LoRA Results

Generated with a custom character LoRA. The character’s features are captured well.

20 steps:

Character LoRA 20 steps

25 steps:

Character LoRA 25 steps

25 steps produces more stable linework and fewer finger artifacts.

Comparison with M1 Max

For reference, the same 1024x1024 generation takes about 40 seconds on an M1 Max (MPS backend). The RTX 4060 Laptop is 2.5x faster thanks to CUDA optimizations.

How VRAM Is Used

At idle (right after launching ComfyUI), VRAM usage is 91MiB. The model is loaded into VRAM only when the first generation request comes in.

The biggest VRAM consumer during generation is the KSampler. It runs the UNet forward pass at each step, so the SDXL UNet (~4.9GB) occupies VRAM throughout. CLIP text encoding finishes instantly, and VAE decode only runs once.

Auto-Unload After Generation

After generation completes, ComfyUI’s memory manager moves the UNet weights from VRAM to system RAM. For 1024x1024, about 3GB is freed and only 1.8GB remains in VRAM.

Unloaded partially: 3056.88 MB freed, 1840.21 MB remains loaded

This is intentional behavior, and it’s a lifesaver on 8GB. Here’s why:

  • Reserves VRAM for the next operation: Frees up space for VAE decode, LoRA patching, etc.
  • Prevents VRAM fragmentation: Keeping weights loaded continuously can fragment memory, making it impossible to allocate large contiguous blocks
  • WDDM considerations: Laptop GPUs share VRAM with display output, so hogging too much can destabilize the system

The eviction target is system RAM, not disk. On the next generation, it only needs a RAM-to-VRAM transfer, which is faster than the initial disk-to-VRAM load. You can force the model to stay resident with --disable-smart-memory, but that’s not recommended on 8GB.

Is 8GB Enough?

WAI-Illustrious ran comfortably both standalone and with LoRA. I initially assumed “the SDXL UNet is 6.5GB so 8GB will be tight,” but peak usage topped out at 5.6GB in practice. fp16 inference and ComfyUI’s memory management make the difference.

Cases where you might run out of VRAM:

  • ControlNet: ControlNet models add 1-2GB on top
  • Multiple LoRAs: Each LoRA is small, but they add up
  • High-res upscaling: 2048x2048 and above can cause VAE decode spikes

In those cases, launching with --lowvram can prevent OOM, but it swaps UNet parts between CPU and GPU so it’s slower.


I used to run a workflow on M1 Max where I’d generate at a smaller size, lock the seed, upscale with Hires, then fix faces with FaceDetailer. With the RTX 4060 Laptop, generating 1024x1024 directly takes just 15 seconds. The same thing on M1 Max takes 40 seconds. The CUDA environment also gets full access to ComfyUI custom nodes and the latest optimizations without waiting for MPS backend support.

The price difference between the two machines is about 100,000 yen (~$670). For image generation specifically, a Windows laptop with an RTX 4060 Laptop is hard to beat on value.