MOVA: the first open-source model that generates video and audio together

Closed video-generation AIs such as Sora 2, Veo 3, Kling, and Google Flow are increasingly able to produce output with audio. The synchronization quality is uneven, though, and you still see examples where lip sync is off or ambient sound drifts.

On the open-source side, models like Wan 2.x and HunyuanVideo are excellent, but they still require audio to be generated separately and then composited later.

MOVA (MOSS Video and Audio) from the OpenMOSS team is the first open-source model that generates video and audio together in a single inference pass. It finally makes local video-plus-audio generation possible without depending on closed models.

Overview

Developer: OpenMOSS team (Shanghai Innovation Institute / Fudan University / MOSI Intelligence)
License: Apache 2.0
Models: 720p and 360p versions are available
Tasks: Text-to-Video-Audio (T2VA), Image-to-Video-Audio (IT2VA)
Output: up to 720p, 8 seconds

The weights and code are available on Hugging Face.

Architecture

MOVA uses an asymmetric dual-tower design. A pre-trained video tower and a pre-trained audio tower are fused with bidirectional cross-attention.

Total parameters: 32B
Active at inference: 18B (Mixture-of-Experts)

The MoE setup follows the same general idea as Wan 2.2: keep inference cost under control while maintaining quality. Unlike a cascaded pipeline that generates video and audio separately and then merges them, MOVA generates both in one pass, which avoids error accumulation.

Notable features

Multilingual lip sync: claims state-of-the-art synchronization of mouth movement and dialogue
Ambient sound effects: automatically generates sound effects based on the scene
LoRA fine-tuning: training scripts are also released, so the model can be adapted for specific use cases

Comparison with Vidu Q3

Vidu Q3, announced by ShengShu Technology in January 2026, also supports simultaneous video and audio generation. Here is a comparison.

Item	MOVA-720p	Vidu Q3
License	Apache 2.0	Closed (API)
Parameters	32B (18B active)	Not disclosed
Max duration	8 seconds	16 seconds
Resolution	720p	Not disclosed
Local execution	Yes	No
Fine-tuning	LoRA supported	No

Vidu Q3 has the longer duration and scored highly on the Artificial Analysis benchmark, ranking first in China and second globally. The downside is that it is API-only and cannot run locally.

MOVA loses on duration, but it stands out because it can run entirely locally and can be fine-tuned. That makes experimentation much easier when you care about cost and privacy.

Where it fits locally

At present, local video models such as Wan 2.x, LTX-2, and HunyuanVideo do not handle audio generation. You have to generate audio separately with TTS or foley models and then combine everything with ffmpeg.

MOVA is the first open-source model that fills that gap.

That said, the VRAM requirements are not yet verified. Given the 32B scale with 18B active parameters, it will likely be heavier than Wan 2.x and closer to Open-Sora 2.0 in memory requirements. Whether it can run on an RTX 4090 with 24GB is still an open question.

Take

Vidu Q3 already proved that video and audio can be generated together, but it was closed, so it was not something you could easily test locally. With MOVA released under Apache 2.0, experiments with simultaneous video and audio generation are now possible on your own machine.

The 8-second duration is short, but it is still enough for short clips and social-media material. If LoRA can tune character identity and voice quality, the use cases will grow.

The big unknowns are the VRAM requirement and actual generation quality. I would like to try it once ComfyUI nodes appear.