Trying to Have a Voice Conversation with AI
What I Want to Build
Speak to it and have AI respond in voice — that’s the system I want to make.
Specifically:
- Speech recognition to transcribe audio to text
- LLM to generate a response
- Text-to-speech to read it aloud
Since it’s a real-time conversation, response latency matters a lot.
A Base Project I Found
I came across irelate-ai/voice-chat on GitHub.
Its default setup:
- Speech recognition: Whisper (in-browser)
- VAD: Silero (voice activity detection — automatically detects when you stop speaking)
- LLM: WebLLM (in-browser Qwen 1.5B)
- TTS: Supertonic (English-focused)
What’s interesting is that the whole thing is designed to run entirely in the browser, no external APIs required.
A demo is available at https://huggingface.co/spaces/RickRossTN/ai-voice-chat.
The Problems
- In-browser LLM is slow: Requires downloading a ~900MB model, and inference is sluggish.
- TTS is English-only: Supertonic can’t handle Japanese properly.
Since I want Japanese voice conversation, both components need to be swapped out.
LLM Choice: Gemini 2.0 Flash
Sending requests to a fast API beats running inference locally in the browser for the user experience.
I looked at a few options:
- Claude API
- OpenAI API
- Gemini API
- Ollama (local)
Why Gemini?
The free tier is generous:
- Gemini 2.0 Flash: 15 requests/minute, up to 1,500 requests/day free
- Gemini 1.5 Pro: 2 requests/minute, up to 50 requests/day free
And it’s cheap when you do pay ($0.10 per million input tokens, $0.40 per million output tokens).
Why Flash?
Pro is overkill. Flash is actually the better choice here.
- Response speed matters enormously for voice conversation
- Casual conversation doesn’t require Pro-level intelligence
- Pro is for complex reasoning and long-form analysis — unnecessary here
Speed wins.
An API key is issued instantly from Google AI Studio with a Google account login.
TTS Choice: VOICEVOX
The question of how to handle Japanese text-to-speech.
Options I Considered
| Method | Speed | Japanese Quality | Effort |
|---|---|---|---|
| VOICEVOX | Fast | Good | Easy (API built-in) |
| VOICEPEAK CLI | Slow | Good | Awkward |
| Gemini Live API | Fast | Unknown | Needs testing |
| Google Cloud TTS | Fast | Good | API billing |
| Style-Bert-VITS2 | Fast | Good | Requires training |
VOICEPEAK Doesn’t Work Here
I actually own VOICEPEAK, so it was my first thought.
But VOICEPEAK has no API. You either use the GUI manually or output files via CLI.
CLI like voicepeak -s "text" -o output.wav works, but it goes through a file — not suited for real-time conversation. It takes a few seconds to generate, plus the file write → read → playback overhead.
I looked into whether you could use just the library component externally, but it’s designed to work only with its own engine in a proprietary format. Reverse engineering is prohibited by the terms of service.
VOICEVOX Is the Clear Choice
VOICEVOX installs, launches, and immediately runs an API server at localhost:50021.
# Create synthesis query
curl -X POST "localhost:50021/audio_query?text=Hello&speaker=8"
# Generate audio
curl -X POST "localhost:50021/synthesis?speaker=8" \
-H "Content-Type: application/json" \
-d @query.json > output.wav
It’s free, zero configuration, and there’s no other real option.
Speaker Selection
VOICEVOX has a bunch of characters. I listened through several looking for a voice close to how I imagine my own character, and landed on Kasukabe Tsumugi (speaker=8). Bright and energetic — that feels right.
https://voicevox.hiroshiba.jp/product/kasukabe_tsumugi/
Modification Plan
Change 1: Swap LLM to Gemini 2.0 Flash
- Remove WebLLM-related code (
use-webllm.ts, model download handling) - Create a new
/api/chatendpoint - Call the Gemini 2.0 Flash API
- Read the API key from environment variable
GEMINI_API_KEY
Change 2: Swap TTS to VOICEVOX
- Remove Supertonic TTS
- Call VOICEVOX API (
localhost:50021) - Speaker: Kasukabe Tsumugi (speaker=8)
- Two-step process:
/audio_query→/synthesis
What Gets Removed
- WebLLM-related code (
use-webllm.ts, model download handling) - Supertonic TTS
- English voice data (
public/voices/)
What Stays
- Whisper STT (speech recognition)
- Silero VAD (voice activity detection)
- Basic UI
Before and After
| Function | Before | After |
|---|---|---|
| Speech recognition | Whisper | Whisper (unchanged) |
| VAD | Silero | Silero (unchanged) |
| LLM | WebLLM (Qwen 1.5B) | Gemini 2.0 Flash API |
| TTS | Supertonic (English) | VOICEVOX (Kasukabe Tsumugi) |
Hardware
My machine is an RTX 4060 Laptop (8GB VRAM).
Project requirements:
- WebGPU-capable browser (Chrome/Edge)
- ~4GB RAM
No problem for the 4060 Laptop. WebGPU just means the browser uses the GPU — nothing special needed.
With the Gemini API-based config, the LLM part is offloaded to the cloud, so local load is even lighter:
- Whisper (STT): 150MB, fast on local GPU
- VAD: 2MB, minimal overhead
- TTS: VOICEVOX (separate process)
- LLM: Gemini API, zero local load
The main bottleneck will just be network latency.
Next
Actually modify the code and get it running.