PersonaPlex: NVIDIA’s Real-Time Full‑Duplex Voice Conversational Model

Update (2026-04-23): Fixed broken internal links.

I looked into PersonaPlex‑7B‑v1, NVIDIA’s voice conversation model released on January 15, 2026.

What Is PersonaPlex?

A real‑time full‑duplex voice conversational AI—it can speak while it listens.

Conventional voice AIs followed turn‑based interactions (listen → process → speak). PersonaPlex supports interruptions and backchannels, enabling human‑like conversation dynamics.

Key Features

Dual streams: Listen to the user while simultaneously generating speech
Interruptions and backchannels: Reproduces natural conversational timing
Persona control: Specify role and personality via text prompts; control timbre via voice prompts
Low latency: Turn‑taking 0.170 s; interruption response 0.240 s

Architecture

Item	Details
Parameters	7B
Base model	Kyutai Moshi (Moshiko weights)
Codec	Mimi Speech Encoder/Decoder
Sample rate	24 kHz
Processing	Temporal Transformer + Depth Transformer

Mimi (ConvNet + Transformer) tokenizes audio, which is then processed by the Moshi architecture. It is based on Kyutai’s Moshi.

Training Data and Performance

Training Data

Fisher English (Parts 1 & 2): About 7,300 conversations (≈10 minutes each), totaling under 10,000 hours.

Benchmarks (FullDuplexBench)

Metric	Score
Smooth turn‑taking	Success 0.908; latency 0.170 s
User interruption	Success 0.950; latency 0.240 s
Voice similarity (SSIM)	0.650
Task adherence (GPT‑4o eval)	4.29/5.0

System Requirements

GPU: NVIDIA A100 / H100 (validated on A100 80GB)
OS: Linux
Runtime: PyTorch + CUDA

It does not run on Apple Silicon. CUDA is required, so M1/M2/M3/M4 are unsupported. A future MLX port or a GGUF‑quantized variant might make it possible, but the Moshi architecture (audio token processing) is difficult to support with standard LLM toolchains.

License

NVIDIA Open Model License + CC‑BY‑4.0. Commercial use permitted.

Resources

I’ve written several other posts on voice AI on this blog.

Pocket TTS — Lightweight CPU‑only Text‑to‑Speech: A TTS model from Kyutai Labs, the creators of Moshi (the base of PersonaPlex)
Building an Environment to Talk with AI (1): Voice API Survey: A comparison of real‑time voice APIs such as Gemini Live API and OpenAI Realtime API
Building an Environment to Talk with AI (3): Finally Talking: Notes on implementing voice dialog with Web Speech API + Gemini + VOICEVOX