InfiniteTalk: Audio-Driven Lip Sync Built on Wan 2.1

Update 2026-04-29: I also wrote up running the same Wan family’s 2.2 on an RTX 4060 → Running WAN 2.2 i2v on an RTX 4060 (8GB VRAM) with ComfyUI

InfiniteTalk, an audio-driven lip-sync model built on Wan 2.1, has been published as an official ComfyUI workflow. Given a still image and an audio file, it can generate a video where the mouth moves in sync with the speech.

What InfiniteTalk is

It is a native ComfyUI workflow published at amao2001/ganloss-latent-space, using a model repackaged by Comfy-Org.

Processing flow:

Input a still image, typically a face image
Input an audio file such as WAV
Analyze the audio with wav2vec2 to extract mouth-movement timing
Generate the video with Wan 2.1 I2V
Output the finished clip

The “Infinite” part means it can handle long audio input. This is offline batch generation, not real-time synthesis.

Workflow variants

Two versions are available:

single: for one character
multi: for multiple characters, with masks specifying each region

In the multi version, you open the first frame in the mask editor and paint each character area separately.

Required models

Directory	Model
diffusion_models	`Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors`
text_encoders	`umt5_xxl_fp8_e4m3fn_scaled.safetensors`
model_patches	`wan2.1_infiniteTalk_single_fp16.safetensors`
model_patches	`wan2.1_infiniteTalk_multi_fp16.safetensors`
audio_encoders	`wav2vec2-chinese-base_fp16.safetensors`
vae	`Wan2_1_VAE_bf16.safetensors`
loras	`lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors`

The models are available from Hugging Face:

Kijai/WanVideo_comfy_fp8_scaled - diffusion models
Comfy-Org/Wan_2.1_ComfyUI_repackaged - text encoders and model patches
Kijai/wav2vec2_safetensors - audio encoders
Kijai/WanVideo_comfy - VAE and LoRAs

Inference is accelerated with the lightx2v distillation LoRA.

How it differs from MOVA and Vidu Q3

There are now several models that deal with both audio and video, but they point in different directions.

Model	Input	Output	Use case
InfiniteTalk	Still image + audio	Lip-sync video	Lip-syncing to existing speech
MOVA	Text or image	Video + audio	Generating both video and audio together
Vidu Q3	Text or image	Video + audio	Generating both video and audio together

InfiniteTalk: audio to video, specialized for lip sync
MOVA / Vidu Q3: prompt to video plus audio, generated together

InfiniteTalk fits cases where the audio already exists and you want to make a character speak, such as narration, voice drama, dubbing, or VTuber-style talking clips.

MOVA and Vidu Q3 are better when you want to generate both the footage and the audio from scratch.

Impression

In my January overview of i2v models, I wrote that accurate synchronization between dialogue and mouth motion was still a weak point for video generation models. InfiniteTalk is a specialized answer to that problem, improving lip-sync accuracy by inserting an explicit audio-analysis stage.

Because it is based on Wan 2.1, it should also be runnable in a local environment. The fact that it has been packaged as an official ComfyUI workflow also lowers the barrier to adoption.

The actual generation quality and VRAM requirements still need hands-on testing, but if you want local audio-driven lip sync generation, this looks like one of the first options to try.