PersonaPlex: NVIDIA’s Real-Time Full‑Duplex Voice Conversational Model
I looked into PersonaPlex‑7B‑v1, NVIDIA’s voice conversation model released on January 15, 2026.
What Is PersonaPlex?
A real‑time full‑duplex voice conversational AI—it can speak while it listens.
Conventional voice AIs followed turn‑based interactions (listen → process → speak). PersonaPlex supports interruptions and backchannels, enabling human‑like conversation dynamics.
Key Features
- Dual streams: Listen to the user while simultaneously generating speech
- Interruptions and backchannels: Reproduces natural conversational timing
- Persona control: Specify role and personality via text prompts; control timbre via voice prompts
- Low latency: Turn‑taking 0.170 s; interruption response 0.240 s
Architecture
| Item | Details |
|---|---|
| Parameters | 7B |
| Base model | Kyutai Moshi (Moshiko weights) |
| Codec | Mimi Speech Encoder/Decoder |
| Sample rate | 24 kHz |
| Processing | Temporal Transformer + Depth Transformer |
Mimi (ConvNet + Transformer) tokenizes audio, which is then processed by the Moshi architecture. It is based on Kyutai’s Moshi.
Training Data and Performance
Training Data
Fisher English (Parts 1 & 2): About 7,300 conversations (≈10 minutes each), totaling under 10,000 hours.
Benchmarks (FullDuplexBench)
| Metric | Score |
|---|---|
| Smooth turn‑taking | Success 0.908; latency 0.170 s |
| User interruption | Success 0.950; latency 0.240 s |
| Voice similarity (SSIM) | 0.650 |
| Task adherence (GPT‑4o eval) | 4.29/5.0 |
System Requirements
- GPU: NVIDIA A100 / H100 (validated on A100 80GB)
- OS: Linux
- Runtime: PyTorch + CUDA
It does not run on Apple Silicon. CUDA is required, so M1/M2/M3/M4 are unsupported. A future MLX port or a GGUF‑quantized variant might make it possible, but the Moshi architecture (audio token processing) is difficult to support with standard LLM toolchains.
License
NVIDIA Open Model License + CC‑BY‑4.0. Commercial use permitted.
Resources
Related Articles
I’ve written several other posts on voice AI on this blog.
- Pocket TTS — Lightweight CPU‑only Text‑to‑Speech: A TTS model from Kyutai Labs, the creators of Moshi (the base of PersonaPlex)
- Building an Environment to Talk with AI (1): Voice API Survey: A comparison of real‑time voice APIs such as Gemini Live API and OpenAI Realtime API
- Building an Environment to Talk with AI (3): Finally Talking: Notes on implementing voice dialog with Web Speech API + Gemini + VOICEVOX