Running WAN 2.2 in ComfyUI on an RTX 4060 (8GB VRAM)
In the Mac version article, I ran WAN 2.2 on M1 Max 64GB and it took 82 minutes for a 2-second video. This time, trying Windows + RTX 4060 (8GB VRAM).
The end result: 14B Rapid distilled model with --lowvram offloading produced this in 111 seconds.
Getting there took three failed attempts with the 5B model. Here’s the record.
Environment
- GPU: NVIDIA GeForce RTX 4060 (8GB VRAM)
- RAM: 32GB
- OS: Windows 11
- ComfyUI: portable version
Model Selection
WAN 2.2 has multiple models:
| Model | Parameters | Use | FP16 size |
|---|---|---|---|
| TI2V 5B | 5B | Text+image → video | 9.3GB |
| T2V 14B | 14B | Text → video | 26.6GB |
| I2V 14B | 14B | Image → video | 26.6GB |
RTX 4060’s 8GB VRAM can’t fit the 14B model even in fp8 (over 13GB). The 5B model is also 9.3GB in fp16, which doesn’t fit.
Kijai distributes fp8 quantized versions, getting the 5B model down to 5.28GB. That fits in 8GB.
Files to Download
| File | Size | Location |
|---|---|---|
| Wan2_2-TI2V-5B_fp8_e4m3fn_scaled_KJ.safetensors | 5.28GB | models/diffusion_models/ |
| umt5_xxl_fp8_e4m3fn_scaled.safetensors | 6.3GB | models/text_encoders/ |
| wan2.2_vae.safetensors | 1.4GB | models/vae/ |
About 13GB total. The text encoder is a hefty 6.3GB, but ComfyUI loads and unloads models sequentially so everything doesn’t need to be in VRAM simultaneously.
Launching ComfyUI
The standard run_nvidia_gpu.bat won’t have enough VRAM. Launch with the --lowvram option:
cd C:\works\ComfyUI_windows_portable
.\python_embeded\python.exe -s ComfyUI\main.py --lowvram
--lowvram loads models into VRAM only when needed and offloads to CPU RAM when done. Slower generation but better than stopping with OOM. 32GB main memory is plenty for offloading.
Workflow
Based on the official WAN 2.2 5B I2V workflow, adjusted for RTX 4060.
Node Structure
LoadImage → Wan22ImageToVideoLatent → KSampler → VAEDecode → SaveWEBM
↑ ↑
VAELoader UNETLoader → ModelSamplingSD3
↑
CLIPLoader → CLIPTextEncode (positive/negative)
Prompt
For I2V, the prompt describes the motion, not the input image. The image is already passed to the model, so the prompt specifies “what to make happen.”
Negative:
ugly, blurry, low quality, worst quality, static, deformed, disfigured,
extra fingers, bad hands, bad face, watermark, text, subtitle
Trial and Error
Attempt 1: 480x480 / 30 steps / CFG 5 / Image Description Prompt
Initially wrote the input image description as the positive prompt.
an anime girl with brown hair in a side ponytail, wearing a white school shirt
with a red tie, winking and making a peace sign with both hands, gentle wind
blowing her hair, cheerful expression
| Item | Value |
|---|---|
| Resolution | 480 x 480 |
| Steps | 30 |
| CFG | 5 |
loaded completely; 5619.19 MB usable, 4856.42 MB loaded, full load: True
30/30 [00:57<00:00, 1.93s/it]
Prompt executed in 94.93 seconds
Model fully loaded into VRAM (full load: True). Generation: 95 seconds. Shape was distorted and barely any motion.
Attempt 2: 768x480 / 50 steps / CFG 3.5 / Image Description Prompt
Raised resolution to the recommended 480p size of 768x480, increased steps to 50, lowered CFG to 3.5.
| Item | Value |
|---|---|
| Resolution | 768 x 480 |
| Steps | 50 |
| CFG | 3.5 |
loaded completely; 5418.30 MB usable, 4856.42 MB loaded, full load: True
50/50 [02:32<00:00, 3.04s/it]
loaded partially; 774.88 MB usable, 612.87 MB loaded, 731.20 MB offloaded
Prompt executed in 178.52 seconds
Diffusion model loaded fully, but VAE decode ran out of memory and had to partially offload (731MB offloaded). Result was even worse than attempt 1. The 768x480 latent was too large for the VAE to function properly.
Attempt 3: 480x480 / 50 steps / CFG 5 / Motion Instruction Prompt
Back to 480x480 to ensure full VAE load. Changed prompt to motion instructions.
gentle wind blowing hair, the girl blinks and smiles, slight head tilt,
hair sways naturally
| Item | Value |
|---|---|
| Resolution | 480 x 480 |
| Steps | 50 |
| CFG | 5 |
loaded completely; 5579.19 MB usable, 4856.42 MB loaded, full load: True
50/50 [01:35<00:00, 1.90s/it]
loaded completely; 1790.47 MB usable, 1344.09 MB loaded, full load: True
Prompt executed in 113.93 seconds
Both diffusion model and VAE fully loaded. Completed in 114 seconds. Changed to motion instruction prompt, but quality was still rough.
Three attempts with 5B model, all rough. The parameters and prompts aren’t the issue — 5B fp8 at 480x480 is simply the quality ceiling.
Retrying with 14B Rapid Model
Phr00t/WAN2.2-14B-Rapid-AllInOne is a distilled, accelerated version of WAN 2.2 14B with text encoder and VAE integrated into a single all-in-one file. It’s 22GB.
22GB obviously won’t fit in 8GB VRAM, but ComfyUI’s --lowvram handles offloading to main memory. Inference runs by shuttling between VRAM and main memory.
Workflow
Different node structure from 5B. AllInOne means a single CheckpointLoaderSimple outputs model, text encoder, and VAE all at once.
CheckpointLoaderSimple → MODEL → ModelSamplingSD3 → KSampler → VAEDecode → SaveWEBM
→ CLIP → CLIPTextEncode (positive/negative)
→ VAE → WanImageToVideo
Parameters
Distilled model so extremely few steps needed.
| Item | Value |
|---|---|
| Resolution | 480 x 480 |
| Frames | 33 |
| Steps | 4 |
| CFG | 1 |
| Sampler | euler_ancestral |
| Scheduler | beta |
Place in models/checkpoints/ (not diffusion_models/).
Result
Requested to load WAN21
loaded partially; 4906.73 MB usable, 4569.21 MB loaded, 11067.11 MB offloaded
4/4 [00:45<00:00, 11.46s/it]
Prompt executed in 111.41 seconds
4569MB of the 14B model in VRAM, 11067MB (~11GB) offloaded to main memory. 4 steps in 45 seconds, 111 seconds total.
Completely different from the 5B results. No shape distortion, hair sways and mouth moves. The animation is subtle, but the input character is preserved properly.
Results Comparison
| Attempt | Model | Settings | Time | Quality |
|---|---|---|---|---|
| 1 | 5B fp8 | 480x480 / 30 steps / CFG 5 | 95s | Distorted + no motion |
| 2 | 5B fp8 | 768x480 / 50 steps / CFG 3.5 | 179s | VAE offload made it worse |
| 3 | 5B fp8 | 480x480 / 50 steps / CFG 5 | 114s | Distorted + little motion |
| 4 | 14B Rapid | 480x480 / 4 steps / CFG 1 | 111s | Shape preserved, hair + mouth movement |
4 steps with 14B Rapid is faster and better quality than more steps with 5B. The power of distillation.
Different Input Image and Prompt
Swapped to a full-body front-facing bunny girl outfit and changed the prompt.
the girl hops in place like a bunny, bouncing up and down playfully,
bunny ear headband bouncing, cheerful jumping motion
Hoped for bunny-style hopping but that much movement didn’t come through. Detailed prompt instructions are harder to land with a 4-step distilled model. However, when the input image changes, the output vibe properly changes too — image selection matters a lot.