MOVA: the first open-source model that generates video and audio together
Contents
Closed video-generation AIs such as Sora 2, Veo 3, Kling, and Google Flow are increasingly able to produce output with audio. The synchronization quality is uneven, though, and you still see examples where lip sync is off or ambient sound drifts.
On the open-source side, models like Wan 2.x and HunyuanVideo are excellent, but they still require audio to be generated separately and then composited later.
MOVA (MOSS Video and Audio) from the OpenMOSS team is the first open-source model that generates video and audio together in a single inference pass. It finally makes local video-plus-audio generation possible without depending on closed models.
Overview
- Developer: OpenMOSS team (Shanghai Innovation Institute / Fudan University / MOSI Intelligence)
- License: Apache 2.0
- Models: 720p and 360p versions are available
- Tasks: Text-to-Video-Audio (T2VA), Image-to-Video-Audio (IT2VA)
- Output: up to 720p, 8 seconds
The weights and code are available on Hugging Face.
Architecture
MOVA uses an asymmetric dual-tower design. A pre-trained video tower and a pre-trained audio tower are fused with bidirectional cross-attention.
- Total parameters: 32B
- Active at inference: 18B (Mixture-of-Experts)
The MoE setup follows the same general idea as Wan 2.2: keep inference cost under control while maintaining quality. Unlike a cascaded pipeline that generates video and audio separately and then merges them, MOVA generates both in one pass, which avoids error accumulation.
Notable features
- Multilingual lip sync: claims state-of-the-art synchronization of mouth movement and dialogue
- Ambient sound effects: automatically generates sound effects based on the scene
- LoRA fine-tuning: training scripts are also released, so the model can be adapted for specific use cases
Comparison with Vidu Q3
Vidu Q3, announced by ShengShu Technology in January 2026, also supports simultaneous video and audio generation. Here is a comparison.
| Item | MOVA-720p | Vidu Q3 |
|---|---|---|
| License | Apache 2.0 | Closed (API) |
| Parameters | 32B (18B active) | Not disclosed |
| Max duration | 8 seconds | 16 seconds |
| Resolution | 720p | Not disclosed |
| Local execution | Yes | No |
| Fine-tuning | LoRA supported | No |
Vidu Q3 has the longer duration and scored highly on the Artificial Analysis benchmark, ranking first in China and second globally. The downside is that it is API-only and cannot run locally.
MOVA loses on duration, but it stands out because it can run entirely locally and can be fine-tuned. That makes experimentation much easier when you care about cost and privacy.
Where it fits locally
At present, local video models such as Wan 2.x, LTX-2, and HunyuanVideo do not handle audio generation. You have to generate audio separately with TTS or foley models and then combine everything with ffmpeg.
MOVA is the first open-source model that fills that gap.
That said, the VRAM requirements are not yet verified. Given the 32B scale with 18B active parameters, it will likely be heavier than Wan 2.x and closer to Open-Sora 2.0 in memory requirements. Whether it can run on an RTX 4090 with 24GB is still an open question.
Take
Vidu Q3 already proved that video and audio can be generated together, but it was closed, so it was not something you could easily test locally. With MOVA released under Apache 2.0, experiments with simultaneous video and audio generation are now possible on your own machine.
The 8-second duration is short, but it is still enough for short clips and social-media material. If LoRA can tune character identity and voice quality, the use cases will grow.
The big unknowns are the VRAM requirement and actual generation quality. I would like to try it once ComfyUI nodes appear.