InfiniteTalk: Audio-Driven Lip Sync Built on Wan 2.1
Contents
InfiniteTalk, an audio-driven lip-sync model built on Wan 2.1, has been published as an official ComfyUI workflow. Given a still image and an audio file, it can generate a video where the mouth moves in sync with the speech.
What InfiniteTalk is
It is a native ComfyUI workflow published at amao2001/ganloss-latent-space, using a model repackaged by Comfy-Org.
Processing flow:
- Input a still image, typically a face image
- Input an audio file such as WAV
- Analyze the audio with wav2vec2 to extract mouth-movement timing
- Generate the video with Wan 2.1 I2V
- Output the finished clip
The “Infinite” part means it can handle long audio input. This is offline batch generation, not real-time synthesis.
Workflow variants
Two versions are available:
- single: for one character
- multi: for multiple characters, with masks specifying each region
In the multi version, you open the first frame in the mask editor and paint each character area separately.
Required models
| Directory | Model |
|---|---|
| diffusion_models | Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors |
| text_encoders | umt5_xxl_fp8_e4m3fn_scaled.safetensors |
| model_patches | wan2.1_infiniteTalk_single_fp16.safetensors |
| model_patches | wan2.1_infiniteTalk_multi_fp16.safetensors |
| audio_encoders | wav2vec2-chinese-base_fp16.safetensors |
| vae | Wan2_1_VAE_bf16.safetensors |
| loras | lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors |
The models are available from Hugging Face:
- Kijai/WanVideo_comfy_fp8_scaled - diffusion models
- Comfy-Org/Wan_2.1_ComfyUI_repackaged - text encoders and model patches
- Kijai/wav2vec2_safetensors - audio encoders
- Kijai/WanVideo_comfy - VAE and LoRAs
Inference is accelerated with the lightx2v distillation LoRA.
How it differs from MOVA and Vidu Q3
There are now several models that deal with both audio and video, but they point in different directions.
| Model | Input | Output | Use case |
|---|---|---|---|
| InfiniteTalk | Still image + audio | Lip-sync video | Lip-syncing to existing speech |
| MOVA | Text or image | Video + audio | Generating both video and audio together |
| Vidu Q3 | Text or image | Video + audio | Generating both video and audio together |
InfiniteTalk: audio to video, specialized for lip sync
MOVA / Vidu Q3: prompt to video plus audio, generated together
InfiniteTalk fits cases where the audio already exists and you want to make a character speak, such as narration, voice drama, dubbing, or VTuber-style talking clips.
MOVA and Vidu Q3 are better when you want to generate both the footage and the audio from scratch.
Impression
In my January overview of i2v models, I wrote that accurate synchronization between dialogue and mouth motion was still a weak point for video generation models. InfiniteTalk is a specialized answer to that problem, improving lip-sync accuracy by inserting an explicit audio-analysis stage.
Because it is based on Wan 2.1, it should also be runnable in a local environment. The fact that it has been packaged as an official ComfyUI workflow also lowers the barrier to adoption.
The actual generation quality and VRAM requirements still need hands-on testing, but if you want local audio-driven lip sync generation, this looks like one of the first options to try.