Tech 3 min read

InfiniteTalk: Audio-Driven Lip Sync Built on Wan 2.1

IkesanContents

InfiniteTalk, an audio-driven lip-sync model built on Wan 2.1, has been published as an official ComfyUI workflow. Given a still image and an audio file, it can generate a video where the mouth moves in sync with the speech.

What InfiniteTalk is

It is a native ComfyUI workflow published at amao2001/ganloss-latent-space, using a model repackaged by Comfy-Org.

Processing flow:

  1. Input a still image, typically a face image
  2. Input an audio file such as WAV
  3. Analyze the audio with wav2vec2 to extract mouth-movement timing
  4. Generate the video with Wan 2.1 I2V
  5. Output the finished clip

The “Infinite” part means it can handle long audio input. This is offline batch generation, not real-time synthesis.

Workflow variants

Two versions are available:

  • single: for one character
  • multi: for multiple characters, with masks specifying each region

In the multi version, you open the first frame in the mask editor and paint each character area separately.

Required models

DirectoryModel
diffusion_modelsWan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors
text_encodersumt5_xxl_fp8_e4m3fn_scaled.safetensors
model_patcheswan2.1_infiniteTalk_single_fp16.safetensors
model_patcheswan2.1_infiniteTalk_multi_fp16.safetensors
audio_encoderswav2vec2-chinese-base_fp16.safetensors
vaeWan2_1_VAE_bf16.safetensors
loraslightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors

The models are available from Hugging Face:

Inference is accelerated with the lightx2v distillation LoRA.

How it differs from MOVA and Vidu Q3

There are now several models that deal with both audio and video, but they point in different directions.

ModelInputOutputUse case
InfiniteTalkStill image + audioLip-sync videoLip-syncing to existing speech
MOVAText or imageVideo + audioGenerating both video and audio together
Vidu Q3Text or imageVideo + audioGenerating both video and audio together

InfiniteTalk: audio to video, specialized for lip sync
MOVA / Vidu Q3: prompt to video plus audio, generated together

InfiniteTalk fits cases where the audio already exists and you want to make a character speak, such as narration, voice drama, dubbing, or VTuber-style talking clips.

MOVA and Vidu Q3 are better when you want to generate both the footage and the audio from scratch.

Impression

In my January overview of i2v models, I wrote that accurate synchronization between dialogue and mouth motion was still a weak point for video generation models. InfiniteTalk is a specialized answer to that problem, improving lip-sync accuracy by inserting an explicit audio-analysis stage.

Because it is based on Wan 2.1, it should also be runnable in a local environment. The fact that it has been packaged as an official ComfyUI workflow also lowers the barrier to adoption.

The actual generation quality and VRAM requirements still need hands-on testing, but if you want local audio-driven lip sync generation, this looks like one of the first options to try.