Building a Voice-Chat AI (1): Voice API Survey

Asking Alexa “What’s the weather today?” is convenient, but it feels a bit sterile. I want to have natural conversations with an AI that has personality. Ultimately, I’d like to build an AI that talks while driving a 3D/2D avatar.

As a first step toward that goal, I looked into the available voice APIs.

Final Goal

Characterized AI (define persona via prompts)
Integration with a 3D/2D avatar (Live2D? VRM?)
Natural voice conversation

In this post, I compare the APIs that will serve as the foundation for the “voice” part.

Scope

Broadly, there are three categories:

Real-time voice chat APIs — Gemini Live, OpenAI Realtime, etc.
App-level voice modes — ChatGPT, Claude, Copilot, etc.
Standalone STT/TTS services — Google Cloud TTS, VOICEVOX, Superwhisper, etc.

Real-Time Voice Chat APIs

APIs for speaking directly with an AI. STT + LLM + TTS are integrated.

Pricing Comparison

Service	Per minute	1 hour/month	Free tier
Gemini Live API	~$0.02 (≈¥3)	~$1.2	Yes
OpenAI Realtime API	~$0.36 (≈¥54)	~$22	No

OpenAI is roughly 16× pricier than Gemini. That’s tough for hobby use.

Gemini Live API

Audio input: about $0.0045/min
Audio output: about $0.018/min
Total: about $0.0225/min
Free tier available (reduced in December 2025)
RPD (requests per day) limited to 20–250

Great cost performance, though the free tier limits are tight.

OpenAI Realtime API

Audio input: $40/1M tokens (about$ 0.12/min)
Audio output: $80/1M tokens (about$ 0.24/min)
Total: about $0.36/min
WebSocket-based, 250–300 ms latency
GPT-4o-realtime-preview model

Quality is high, but the price is steep.

App Voice Modes

Each vendor’s app has a voice chat feature. It runs inside the app rather than via your own API calls.

ChatGPT Voice Mode

Plan	Monthly	Voice mode
Free	$0	Standard (GPT-4o mini), daily limits
Plus	$20	Advanced Voice Mode
Pro	$200	Unlimited

Available on desktop browser, mobile apps, and desktop app
Supports 50 languages including Japanese
Response time 2–3 seconds
Interruptible (more natural conversation)

If you want a flat-rate, hassle-free option, ChatGPT Plus is the practical choice.

Claude Voice Mode

Available on iOS/Android apps
Available even on the free plan (since June 2025)
Currently English-only (no Japanese yet)
Five voices to choose from
Latency 300–360 ms

The lack of Japanese is unfortunate; hoping that changes.

Microsoft Copilot Voice

Free to use
Available on Windows/Mac/iOS/Android
Hands-free wake word: “Hey Copilot”
Japanese support is rolling out (accuracy still has room to improve)
Strong integration with Microsoft 365

Nice that it’s free, but Japanese support isn’t fully there yet.

TTS (Text-to-Speech) Services

Services that convert text into speech—useful for speaking an LLM’s output.

Google Cloud Text-to-Speech

Voice type	Free tier	Paid
Standard	4M chars/month	Low cost
WaveNet	1M chars/month	$16/1M chars
Neural2	1M chars/month	Same as above

Very generous free tier (4M chars/month is a lot)
Full Japanese support; multiple male and female voices
WaveNet/Neural2 produce natural speech

The large free tier is compelling. Quality is good enough for business use.

Local TTS

Options that run locally without relying on the cloud.

VOICEVOX

Japanese open source
Completely free; commercial use allowed (credit required)
No GPU required; easy to set up
30+ character voices
Full Japanese support

Top pick if you want character voices.

Style-Bert-VITS2

Excellent emotional expression
Often rated more natural than VOICEVOX
No GPU needed (only recommended for training)
Check the license (AGPLv3)

Choose this if you want richer emotion.

Coqui TTS (XTTS-v2)

Supports 17 languages
Can clone a voice from a 6-second sample
Apache 2.0 license
GPU recommended

Good when you want to reproduce a specific voice.

Piper TTS

Ultra-lightweight (runs even on Raspberry Pi)
No GPU required
Limited stock voices

Suited for embedded use.

VOICEPEAK

Paid TTS from AHS (one-time purchase around ¥20,000+)
CLI version available; automatable
Good audio quality
Slow processing (about 9 seconds even for short text)
Single concurrent instance only

A viable option if you pre-generate for quality, but not suitable for real-time conversation.

STT (Speech-to-Text) Services

Services that convert audio into text.

Cloud STT

Service	Feature	Cost
OpenAI Whisper API	High accuracy, multilingual	$0.006/min
Google Speech-to-Text	High accuracy	From $0.006/min
Web Speech API	Built into browsers	Free

Apps specialized for audio input. They don’t just transcribe; they also clean up text into natural sentences with AI.

Superwhisper

macOS/Windows/iOS
One-time purchase available (not subscription-only)
Works offline (better for privacy)
Supports Japanese (100+ languages)
Filler removal, auto punctuation, grammar cleanup

One-time license + offline support are attractive.

Wispr Flow

macOS/Windows/iOS (Android coming soon)
$15/month (free tier: 2,000 words/week)
Great with mixed languages (handles Japanese/English mixed speech)
Supports Japanese (100+ languages)
Filler removal, smart formatting

A good choice if you often mix Japanese and English when speaking.

Architecture Patterns

Based on the findings, here are several possible setups.

Pattern 1: Easy Flat-Rate

ChatGPT Plus ($20/month)

Voice chat lives entirely inside the app
Supports Japanese; good quality
Hard to integrate with an avatar (no API access)

If you just want an easy “talk with AI” experience, this is sufficient.

Pattern 2: Cost-Efficient Cloud

STT: Web Speech API (free) or Google STT
LLM: Claude / GPT-4o / Gemini (take your pick)
TTS: Google Cloud TTS (4M chars/month free)

You can go pretty far by combining free tiers.

Pattern 3: Fully Local

STT: Superwhisper (buy-once, offline)
LLM: Local LLM or a cloud API
TTS: VOICEVOX (free, many character voices)

Choose this for privacy plus character voices.

Pattern 4: Audio In → AI Cleanup → Reply

STT + cleanup: Superwhisper / Wispr Flow
LLM: Claude / GPT-4o (for deeper understanding and reply generation)
TTS: VOICEVOX / Style-Bert-VITS2

Because AI refinement runs at input time, the LLM receives cleaner prompts.

Summary

Perspective	Recommendation
Cost focus	Gemini Live API or a local setup
Ease of use	ChatGPT Plus ($20/month flat rate)
Character voices	VOICEVOX (free, many voices)
Emotional expression	Style-Bert-VITS2
Big TTS free tier	Google Cloud TTS (4M chars/month)
Offline	Superwhisper + VOICEVOX
Mixed languages	Wispr Flow

Conclusion: What’s Next

Assuming a characterful AI + avatar, here’s the plan:

TTS: VOICEVOX or Style-Bert-VITS2

Free, with voices that carry character
Easy to sync with an avatar (clear text → audio pipeline)

STT: Superwhisper or Web Speech API

Superwhisper is buy-once + offline + AI cleanup
Web Speech API is free and easy

LLM: Your preference

Claude, GPT-4o, Gemini, etc.

With this stack, you can build a voice chat setup while paying only for the LLM API.

Next up, I’ll research avatar integration (Live2D? VRM?).