Building a Voice-Chat AI (1): Voice API Survey
Asking Alexa “What’s the weather today?” is convenient, but it feels a bit sterile. I want to have natural conversations with an AI that has personality. Ultimately, I’d like to build an AI that talks while driving a 3D/2D avatar.
As a first step toward that goal, I looked into the available voice APIs.
Final Goal
- Characterized AI (define persona via prompts)
- Integration with a 3D/2D avatar (Live2D? VRM?)
- Natural voice conversation
In this post, I compare the APIs that will serve as the foundation for the “voice” part.
Scope
Broadly, there are three categories:
- Real-time voice chat APIs — Gemini Live, OpenAI Realtime, etc.
- App-level voice modes — ChatGPT, Claude, Copilot, etc.
- Standalone STT/TTS services — Google Cloud TTS, VOICEVOX, Superwhisper, etc.
Real-Time Voice Chat APIs
APIs for speaking directly with an AI. STT + LLM + TTS are integrated.
Pricing Comparison
| Service | Per minute | 1 hour/month | Free tier |
|---|---|---|---|
| Gemini Live API | ~$0.02 (≈¥3) | ~$1.2 | Yes |
| OpenAI Realtime API | ~$0.36 (≈¥54) | ~$22 | No |
OpenAI is roughly 16× pricier than Gemini. That’s tough for hobby use.
Gemini Live API
- Audio input: about $0.0045/min
- Audio output: about $0.018/min
- Total: about $0.0225/min
- Free tier available (reduced in December 2025)
- RPD (requests per day) limited to 20–250
Great cost performance, though the free tier limits are tight.
OpenAI Realtime API
- Audio input: 0.12/min)
- Audio output: 0.24/min)
- Total: about $0.36/min
- WebSocket-based, 250–300 ms latency
- GPT-4o-realtime-preview model
Quality is high, but the price is steep.
App Voice Modes
Each vendor’s app has a voice chat feature. It runs inside the app rather than via your own API calls.
ChatGPT Voice Mode
| Plan | Monthly | Voice mode |
|---|---|---|
| Free | $0 | Standard (GPT-4o mini), daily limits |
| Plus | $20 | Advanced Voice Mode |
| Pro | $200 | Unlimited |
- Available on desktop browser, mobile apps, and desktop app
- Supports 50 languages including Japanese
- Response time 2–3 seconds
- Interruptible (more natural conversation)
If you want a flat-rate, hassle-free option, ChatGPT Plus is the practical choice.
Claude Voice Mode
- Available on iOS/Android apps
- Available even on the free plan (since June 2025)
- Currently English-only (no Japanese yet)
- Five voices to choose from
- Latency 300–360 ms
The lack of Japanese is unfortunate; hoping that changes.
Microsoft Copilot Voice
- Free to use
- Available on Windows/Mac/iOS/Android
- Hands-free wake word: “Hey Copilot”
- Japanese support is rolling out (accuracy still has room to improve)
- Strong integration with Microsoft 365
Nice that it’s free, but Japanese support isn’t fully there yet.
TTS (Text-to-Speech) Services
Services that convert text into speech—useful for speaking an LLM’s output.
Google Cloud Text-to-Speech
| Voice type | Free tier | Paid |
|---|---|---|
| Standard | 4M chars/month | Low cost |
| WaveNet | 1M chars/month | $16/1M chars |
| Neural2 | 1M chars/month | Same as above |
- Very generous free tier (4M chars/month is a lot)
- Full Japanese support; multiple male and female voices
- WaveNet/Neural2 produce natural speech
The large free tier is compelling. Quality is good enough for business use.
Local TTS
Options that run locally without relying on the cloud.
VOICEVOX
- Japanese open source
- Completely free; commercial use allowed (credit required)
- No GPU required; easy to set up
- 30+ character voices
- Full Japanese support
Top pick if you want character voices.
Style-Bert-VITS2
- Excellent emotional expression
- Often rated more natural than VOICEVOX
- No GPU needed (only recommended for training)
- Check the license (AGPLv3)
Choose this if you want richer emotion.
Coqui TTS (XTTS-v2)
- Supports 17 languages
- Can clone a voice from a 6-second sample
- Apache 2.0 license
- GPU recommended
Good when you want to reproduce a specific voice.
Piper TTS
- Ultra-lightweight (runs even on Raspberry Pi)
- No GPU required
- Limited stock voices
Suited for embedded use.
VOICEPEAK
- Paid TTS from AHS (one-time purchase around ¥20,000+)
- CLI version available; automatable
- Good audio quality
- Slow processing (about 9 seconds even for short text)
- Single concurrent instance only
A viable option if you pre-generate for quality, but not suitable for real-time conversation.
STT (Speech-to-Text) Services
Services that convert audio into text.
Cloud STT
| Service | Feature | Cost |
|---|---|---|
| OpenAI Whisper API | High accuracy, multilingual | $0.006/min |
| Google Speech-to-Text | High accuracy | From $0.006/min |
| Web Speech API | Built into browsers | Free |
Local STT + AI Refinement Tools
Apps specialized for audio input. They don’t just transcribe; they also clean up text into natural sentences with AI.
Superwhisper
- macOS/Windows/iOS
- One-time purchase available (not subscription-only)
- Works offline (better for privacy)
- Supports Japanese (100+ languages)
- Filler removal, auto punctuation, grammar cleanup
One-time license + offline support are attractive.
Wispr Flow
- macOS/Windows/iOS (Android coming soon)
- $15/month (free tier: 2,000 words/week)
- Great with mixed languages (handles Japanese/English mixed speech)
- Supports Japanese (100+ languages)
- Filler removal, smart formatting
A good choice if you often mix Japanese and English when speaking.
Architecture Patterns
Based on the findings, here are several possible setups.
Pattern 1: Easy Flat-Rate
ChatGPT Plus ($20/month)
- Voice chat lives entirely inside the app
- Supports Japanese; good quality
- Hard to integrate with an avatar (no API access)
If you just want an easy “talk with AI” experience, this is sufficient.
Pattern 2: Cost-Efficient Cloud
- STT: Web Speech API (free) or Google STT
- LLM: Claude / GPT-4o / Gemini (take your pick)
- TTS: Google Cloud TTS (4M chars/month free)
You can go pretty far by combining free tiers.
Pattern 3: Fully Local
- STT: Superwhisper (buy-once, offline)
- LLM: Local LLM or a cloud API
- TTS: VOICEVOX (free, many character voices)
Choose this for privacy plus character voices.
Pattern 4: Audio In → AI Cleanup → Reply
- STT + cleanup: Superwhisper / Wispr Flow
- LLM: Claude / GPT-4o (for deeper understanding and reply generation)
- TTS: VOICEVOX / Style-Bert-VITS2
Because AI refinement runs at input time, the LLM receives cleaner prompts.
Summary
| Perspective | Recommendation |
|---|---|
| Cost focus | Gemini Live API or a local setup |
| Ease of use | ChatGPT Plus ($20/month flat rate) |
| Character voices | VOICEVOX (free, many voices) |
| Emotional expression | Style-Bert-VITS2 |
| Big TTS free tier | Google Cloud TTS (4M chars/month) |
| Offline | Superwhisper + VOICEVOX |
| Mixed languages | Wispr Flow |
Conclusion: What’s Next
Assuming a characterful AI + avatar, here’s the plan:
TTS: VOICEVOX or Style-Bert-VITS2
- Free, with voices that carry character
- Easy to sync with an avatar (clear text → audio pipeline)
STT: Superwhisper or Web Speech API
- Superwhisper is buy-once + offline + AI cleanup
- Web Speech API is free and easy
LLM: Your preference
- Claude, GPT-4o, Gemini, etc.
With this stack, you can build a voice chat setup while paying only for the LLM API.
Next up, I’ll research avatar integration (Live2D? VRM?).