WAN 2.2 I2V on RTX 4060 8GB in ComfyUI: 111 seconds per clip with 14B Rapid distilled

In the Mac version article, I ran WAN 2.2 on M1 Max 64GB and it took 82 minutes for a 2-second video. This time, trying Windows + RTX 4060 (8GB VRAM).

The end result: 14B Rapid distilled model with --lowvram offloading produced this in 111 seconds.

Getting there took three failed attempts with the 5B model. Here’s the record.

Environment

GPU: NVIDIA GeForce RTX 4060 (8GB VRAM)
RAM: 32GB
OS: Windows 11
ComfyUI: portable version

Model Selection

WAN 2.2 has multiple models:

Model	Parameters	Use	FP16 size
TI2V 5B	5B	Text+image → video	9.3GB
T2V 14B	14B	Text → video	26.6GB
I2V 14B	14B	Image → video	26.6GB

RTX 4060’s 8GB VRAM can’t fit the 14B model even in fp8 (over 13GB). The 5B model is also 9.3GB in fp16, which doesn’t fit.

Kijai distributes fp8 quantized versions, getting the 5B model down to 5.28GB. That fits in 8GB.

Files to Download

File	Size	Location
Wan2_2-TI2V-5B_fp8_e4m3fn_scaled_KJ.safetensors	5.28GB	models/diffusion_models/
umt5_xxl_fp8_e4m3fn_scaled.safetensors	6.3GB	models/text_encoders/
wan2.2_vae.safetensors	1.4GB	models/vae/

About 13GB total. The text encoder is a hefty 6.3GB, but ComfyUI loads and unloads models sequentially so everything doesn’t need to be in VRAM simultaneously.

Launching ComfyUI

The standard run_nvidia_gpu.bat won’t have enough VRAM. Launch with the --lowvram option:

cd C:\works\ComfyUI_windows_portable
.\python_embeded\python.exe -s ComfyUI\main.py --lowvram

--lowvram loads models into VRAM only when needed and offloads to CPU RAM when done. Slower generation but better than stopping with OOM. 32GB main memory is plenty for offloading.

Workflow

Based on the official WAN 2.2 5B I2V workflow, adjusted for RTX 4060.

Node Structure

LoadImage → Wan22ImageToVideoLatent → KSampler → VAEDecode → SaveWEBM
                    ↑                     ↑
               VAELoader            UNETLoader → ModelSamplingSD3
                                        ↑
                                   CLIPLoader → CLIPTextEncode (positive/negative)

Prompt

For I2V, the prompt describes the motion, not the input image. The image is already passed to the model, so the prompt specifies “what to make happen.”

Negative:

ugly, blurry, low quality, worst quality, static, deformed, disfigured,
extra fingers, bad hands, bad face, watermark, text, subtitle

Trial and Error

Attempt 1: 480x480 / 30 steps / CFG 5 / Image Description Prompt

Initially wrote the input image description as the positive prompt.

an anime girl with brown hair in a side ponytail, wearing a white school shirt
with a red tie, winking and making a peace sign with both hands, gentle wind
blowing her hair, cheerful expression

Item	Value
Resolution	480 x 480
Steps	30
CFG	5

loaded completely; 5619.19 MB usable, 4856.42 MB loaded, full load: True
30/30 [00:57<00:00, 1.93s/it]
Prompt executed in 94.93 seconds

Model fully loaded into VRAM (full load: True). Generation: 95 seconds. Shape was distorted and barely any motion.

Attempt 2: 768x480 / 50 steps / CFG 3.5 / Image Description Prompt

Raised resolution to the recommended 480p size of 768x480, increased steps to 50, lowered CFG to 3.5.

Item	Value
Resolution	768 x 480
Steps	50
CFG	3.5

loaded completely; 5418.30 MB usable, 4856.42 MB loaded, full load: True
50/50 [02:32<00:00, 3.04s/it]
loaded partially; 774.88 MB usable, 612.87 MB loaded, 731.20 MB offloaded
Prompt executed in 178.52 seconds

Diffusion model loaded fully, but VAE decode ran out of memory and had to partially offload (731MB offloaded). Result was even worse than attempt 1. The 768x480 latent was too large for the VAE to function properly.

Attempt 3: 480x480 / 50 steps / CFG 5 / Motion Instruction Prompt

Back to 480x480 to ensure full VAE load. Changed prompt to motion instructions.

gentle wind blowing hair, the girl blinks and smiles, slight head tilt,
hair sways naturally

Item	Value
Resolution	480 x 480
Steps	50
CFG	5

loaded completely; 5579.19 MB usable, 4856.42 MB loaded, full load: True
50/50 [01:35<00:00, 1.90s/it]
loaded completely; 1790.47 MB usable, 1344.09 MB loaded, full load: True
Prompt executed in 113.93 seconds

Both diffusion model and VAE fully loaded. Completed in 114 seconds. Changed to motion instruction prompt, but quality was still rough.

Three attempts with 5B model, all rough. The parameters and prompts aren’t the issue — 5B fp8 at 480x480 is simply the quality ceiling.

Retrying with 14B Rapid Model

Phr00t/WAN2.2-14B-Rapid-AllInOne is a distilled, accelerated version of WAN 2.2 14B with text encoder and VAE integrated into a single all-in-one file. It’s 22GB.

22GB obviously won’t fit in 8GB VRAM, but ComfyUI’s --lowvram handles offloading to main memory. Inference runs by shuttling between VRAM and main memory.

Workflow

Different node structure from 5B. AllInOne means a single CheckpointLoaderSimple outputs model, text encoder, and VAE all at once.

CheckpointLoaderSimple → MODEL → ModelSamplingSD3 → KSampler → VAEDecode → SaveWEBM
                       → CLIP  → CLIPTextEncode (positive/negative)
                       → VAE   → WanImageToVideo

Parameters

Distilled model so extremely few steps needed.

Item	Value
Resolution	480 x 480
Frames	33
Steps	4
CFG	1
Sampler	euler_ancestral
Scheduler	beta

Place in models/checkpoints/ (not diffusion_models/).

Result

Requested to load WAN21
loaded partially; 4906.73 MB usable, 4569.21 MB loaded, 11067.11 MB offloaded
4/4 [00:45<00:00, 11.46s/it]
Prompt executed in 111.41 seconds

4569MB of the 14B model in VRAM, 11067MB (~11GB) offloaded to main memory. 4 steps in 45 seconds, 111 seconds total.

Completely different from the 5B results. No shape distortion, hair sways and mouth moves. The animation is subtle, but the input character is preserved properly.

Results Comparison

Attempt	Model	Settings	Time	Quality
1	5B fp8	480x480 / 30 steps / CFG 5	95s	Distorted + no motion
2	5B fp8	768x480 / 50 steps / CFG 3.5	179s	VAE offload made it worse
3	5B fp8	480x480 / 50 steps / CFG 5	114s	Distorted + little motion
4	14B Rapid	480x480 / 4 steps / CFG 1	111s	Shape preserved, hair + mouth movement

4 steps with 14B Rapid is faster and better quality than more steps with 5B. The power of distillation.

Different Input Image and Prompt

Swapped to a full-body front-facing bunny girl outfit and changed the prompt.

the girl hops in place like a bunny, bouncing up and down playfully,
bunny ear headband bouncing, cheerful jumping motion

Hoped for bunny-style hopping but that much movement didn’t come through. Detailed prompt instructions are harder to land with a 4-step distilled model. However, when the input image changes, the output vibe properly changes too — image selection matters a lot.