Tech 2 min read

PersonaPlex: NVIDIA’s Real-Time Full‑Duplex Voice Conversational Model

I looked into PersonaPlex‑7B‑v1, NVIDIA’s voice conversation model released on January 15, 2026.

What Is PersonaPlex?

A real‑time full‑duplex voice conversational AI—it can speak while it listens.

Conventional voice AIs followed turn‑based interactions (listen → process → speak). PersonaPlex supports interruptions and backchannels, enabling human‑like conversation dynamics.

Key Features

  • Dual streams: Listen to the user while simultaneously generating speech
  • Interruptions and backchannels: Reproduces natural conversational timing
  • Persona control: Specify role and personality via text prompts; control timbre via voice prompts
  • Low latency: Turn‑taking 0.170 s; interruption response 0.240 s

Architecture

ItemDetails
Parameters7B
Base modelKyutai Moshi (Moshiko weights)
CodecMimi Speech Encoder/Decoder
Sample rate24 kHz
ProcessingTemporal Transformer + Depth Transformer

Mimi (ConvNet + Transformer) tokenizes audio, which is then processed by the Moshi architecture. It is based on Kyutai’s Moshi.

Training Data and Performance

Training Data

Fisher English (Parts 1 & 2): About 7,300 conversations (≈10 minutes each), totaling under 10,000 hours.

Benchmarks (FullDuplexBench)

MetricScore
Smooth turn‑takingSuccess 0.908; latency 0.170 s
User interruptionSuccess 0.950; latency 0.240 s
Voice similarity (SSIM)0.650
Task adherence (GPT‑4o eval)4.29/5.0

System Requirements

  • GPU: NVIDIA A100 / H100 (validated on A100 80GB)
  • OS: Linux
  • Runtime: PyTorch + CUDA

It does not run on Apple Silicon. CUDA is required, so M1/M2/M3/M4 are unsupported. A future MLX port or a GGUF‑quantized variant might make it possible, but the Moshi architecture (audio token processing) is difficult to support with standard LLM toolchains.

License

NVIDIA Open Model License + CC‑BY‑4.0. Commercial use permitted.

Resources

I’ve written several other posts on voice AI on this blog.