Tech 6 min read

Building a Voice-Chat AI (1): Voice API Survey

Asking Alexa “What’s the weather today?” is convenient, but it feels a bit sterile. I want to have natural conversations with an AI that has personality. Ultimately, I’d like to build an AI that talks while driving a 3D/2D avatar.

As a first step toward that goal, I looked into the available voice APIs.

Final Goal

  • Characterized AI (define persona via prompts)
  • Integration with a 3D/2D avatar (Live2D? VRM?)
  • Natural voice conversation

In this post, I compare the APIs that will serve as the foundation for the “voice” part.

Scope

Broadly, there are three categories:

  1. Real-time voice chat APIs — Gemini Live, OpenAI Realtime, etc.
  2. App-level voice modes — ChatGPT, Claude, Copilot, etc.
  3. Standalone STT/TTS services — Google Cloud TTS, VOICEVOX, Superwhisper, etc.

Real-Time Voice Chat APIs

APIs for speaking directly with an AI. STT + LLM + TTS are integrated.

Pricing Comparison

ServicePer minute1 hour/monthFree tier
Gemini Live API~$0.02 (≈¥3)~$1.2Yes
OpenAI Realtime API~$0.36 (≈¥54)~$22No

OpenAI is roughly 16× pricier than Gemini. That’s tough for hobby use.

Gemini Live API

  • Audio input: about $0.0045/min
  • Audio output: about $0.018/min
  • Total: about $0.0225/min
  • Free tier available (reduced in December 2025)
  • RPD (requests per day) limited to 20–250

Great cost performance, though the free tier limits are tight.

OpenAI Realtime API

  • Audio input: 40/1Mtokens(about40/1M tokens (about 0.12/min)
  • Audio output: 80/1Mtokens(about80/1M tokens (about 0.24/min)
  • Total: about $0.36/min
  • WebSocket-based, 250–300 ms latency
  • GPT-4o-realtime-preview model

Quality is high, but the price is steep.

App Voice Modes

Each vendor’s app has a voice chat feature. It runs inside the app rather than via your own API calls.

ChatGPT Voice Mode

PlanMonthlyVoice mode
Free$0Standard (GPT-4o mini), daily limits
Plus$20Advanced Voice Mode
Pro$200Unlimited
  • Available on desktop browser, mobile apps, and desktop app
  • Supports 50 languages including Japanese
  • Response time 2–3 seconds
  • Interruptible (more natural conversation)

If you want a flat-rate, hassle-free option, ChatGPT Plus is the practical choice.

Claude Voice Mode

  • Available on iOS/Android apps
  • Available even on the free plan (since June 2025)
  • Currently English-only (no Japanese yet)
  • Five voices to choose from
  • Latency 300–360 ms

The lack of Japanese is unfortunate; hoping that changes.

Microsoft Copilot Voice

  • Free to use
  • Available on Windows/Mac/iOS/Android
  • Hands-free wake word: “Hey Copilot”
  • Japanese support is rolling out (accuracy still has room to improve)
  • Strong integration with Microsoft 365

Nice that it’s free, but Japanese support isn’t fully there yet.

TTS (Text-to-Speech) Services

Services that convert text into speech—useful for speaking an LLM’s output.

Google Cloud Text-to-Speech

Voice typeFree tierPaid
Standard4M chars/monthLow cost
WaveNet1M chars/month$16/1M chars
Neural21M chars/monthSame as above
  • Very generous free tier (4M chars/month is a lot)
  • Full Japanese support; multiple male and female voices
  • WaveNet/Neural2 produce natural speech

The large free tier is compelling. Quality is good enough for business use.

Local TTS

Options that run locally without relying on the cloud.

VOICEVOX

  • Japanese open source
  • Completely free; commercial use allowed (credit required)
  • No GPU required; easy to set up
  • 30+ character voices
  • Full Japanese support

Top pick if you want character voices.

Style-Bert-VITS2

  • Excellent emotional expression
  • Often rated more natural than VOICEVOX
  • No GPU needed (only recommended for training)
  • Check the license (AGPLv3)

Choose this if you want richer emotion.

Coqui TTS (XTTS-v2)

  • Supports 17 languages
  • Can clone a voice from a 6-second sample
  • Apache 2.0 license
  • GPU recommended

Good when you want to reproduce a specific voice.

Piper TTS

  • Ultra-lightweight (runs even on Raspberry Pi)
  • No GPU required
  • Limited stock voices

Suited for embedded use.

VOICEPEAK

  • Paid TTS from AHS (one-time purchase around ¥20,000+)
  • CLI version available; automatable
  • Good audio quality
  • Slow processing (about 9 seconds even for short text)
  • Single concurrent instance only

A viable option if you pre-generate for quality, but not suitable for real-time conversation.

STT (Speech-to-Text) Services

Services that convert audio into text.

Cloud STT

ServiceFeatureCost
OpenAI Whisper APIHigh accuracy, multilingual$0.006/min
Google Speech-to-TextHigh accuracyFrom $0.006/min
Web Speech APIBuilt into browsersFree

Local STT + AI Refinement Tools

Apps specialized for audio input. They don’t just transcribe; they also clean up text into natural sentences with AI.

Superwhisper

  • macOS/Windows/iOS
  • One-time purchase available (not subscription-only)
  • Works offline (better for privacy)
  • Supports Japanese (100+ languages)
  • Filler removal, auto punctuation, grammar cleanup

One-time license + offline support are attractive.

Wispr Flow

  • macOS/Windows/iOS (Android coming soon)
  • $15/month (free tier: 2,000 words/week)
  • Great with mixed languages (handles Japanese/English mixed speech)
  • Supports Japanese (100+ languages)
  • Filler removal, smart formatting

A good choice if you often mix Japanese and English when speaking.

Architecture Patterns

Based on the findings, here are several possible setups.

Pattern 1: Easy Flat-Rate

ChatGPT Plus ($20/month)

  • Voice chat lives entirely inside the app
  • Supports Japanese; good quality
  • Hard to integrate with an avatar (no API access)

If you just want an easy “talk with AI” experience, this is sufficient.

Pattern 2: Cost-Efficient Cloud

  • STT: Web Speech API (free) or Google STT
  • LLM: Claude / GPT-4o / Gemini (take your pick)
  • TTS: Google Cloud TTS (4M chars/month free)

You can go pretty far by combining free tiers.

Pattern 3: Fully Local

  • STT: Superwhisper (buy-once, offline)
  • LLM: Local LLM or a cloud API
  • TTS: VOICEVOX (free, many character voices)

Choose this for privacy plus character voices.

Pattern 4: Audio In → AI Cleanup → Reply

  • STT + cleanup: Superwhisper / Wispr Flow
  • LLM: Claude / GPT-4o (for deeper understanding and reply generation)
  • TTS: VOICEVOX / Style-Bert-VITS2

Because AI refinement runs at input time, the LLM receives cleaner prompts.

Summary

PerspectiveRecommendation
Cost focusGemini Live API or a local setup
Ease of useChatGPT Plus ($20/month flat rate)
Character voicesVOICEVOX (free, many voices)
Emotional expressionStyle-Bert-VITS2
Big TTS free tierGoogle Cloud TTS (4M chars/month)
OfflineSuperwhisper + VOICEVOX
Mixed languagesWispr Flow

Conclusion: What’s Next

Assuming a characterful AI + avatar, here’s the plan:

TTS: VOICEVOX or Style-Bert-VITS2

  • Free, with voices that carry character
  • Easy to sync with an avatar (clear text → audio pipeline)

STT: Superwhisper or Web Speech API

  • Superwhisper is buy-once + offline + AI cleanup
  • Web Speech API is free and easy

LLM: Your preference

  • Claude, GPT-4o, Gemini, etc.

With this stack, you can build a voice chat setup while paying only for the LLM API.

Next up, I’ll research avatar integration (Live2D? VRM?).