Trying to Have a Voice Conversation with AI

What I Want to Build

Speak to it and have AI respond in voice — that’s the system I want to make.

Specifically:

Speech recognition to transcribe audio to text
LLM to generate a response
Text-to-speech to read it aloud

Since it’s a real-time conversation, response latency matters a lot.

A Base Project I Found

I came across irelate-ai/voice-chat on GitHub.

Its default setup:

Speech recognition: Whisper (in-browser)
VAD: Silero (voice activity detection — automatically detects when you stop speaking)
LLM: WebLLM (in-browser Qwen 1.5B)
TTS: Supertonic (English-focused)

What’s interesting is that the whole thing is designed to run entirely in the browser, no external APIs required.

A demo is available at https://huggingface.co/spaces/RickRossTN/ai-voice-chat.

The Problems

In-browser LLM is slow: Requires downloading a ~900MB model, and inference is sluggish.
TTS is English-only: Supertonic can’t handle Japanese properly.

Since I want Japanese voice conversation, both components need to be swapped out.

LLM Choice: Gemini 2.0 Flash

Sending requests to a fast API beats running inference locally in the browser for the user experience.

I looked at a few options:

Claude API
OpenAI API
Gemini API
Ollama (local)

Why Gemini?

The free tier is generous:

Gemini 2.0 Flash: 15 requests/minute, up to 1,500 requests/day free
Gemini 1.5 Pro: 2 requests/minute, up to 50 requests/day free

And it’s cheap when you do pay ($0.10 per million input tokens, $0.40 per million output tokens).

Why Flash?

Pro is overkill. Flash is actually the better choice here.

Response speed matters enormously for voice conversation
Casual conversation doesn’t require Pro-level intelligence
Pro is for complex reasoning and long-form analysis — unnecessary here

Speed wins.

An API key is issued instantly from Google AI Studio with a Google account login.

TTS Choice: VOICEVOX

The question of how to handle Japanese text-to-speech.

Options I Considered

Method	Speed	Japanese Quality	Effort
VOICEVOX	Fast	Good	Easy (API built-in)
VOICEPEAK CLI	Slow	Good	Awkward
Gemini Live API	Fast	Unknown	Needs testing
Google Cloud TTS	Fast	Good	API billing
Style-Bert-VITS2	Fast	Good	Requires training

VOICEPEAK Doesn’t Work Here

I actually own VOICEPEAK, so it was my first thought.

But VOICEPEAK has no API. You either use the GUI manually or output files via CLI.

CLI like voicepeak -s "text" -o output.wav works, but it goes through a file — not suited for real-time conversation. It takes a few seconds to generate, plus the file write → read → playback overhead.

I looked into whether you could use just the library component externally, but it’s designed to work only with its own engine in a proprietary format. Reverse engineering is prohibited by the terms of service.

VOICEVOX Is the Clear Choice

VOICEVOX installs, launches, and immediately runs an API server at localhost:50021.

# Create synthesis query
curl -X POST "localhost:50021/audio_query?text=Hello&speaker=8"

# Generate audio
curl -X POST "localhost:50021/synthesis?speaker=8" \
  -H "Content-Type: application/json" \
  -d @query.json > output.wav

It’s free, zero configuration, and there’s no other real option.

Speaker Selection

VOICEVOX has a bunch of characters. I listened through several looking for a voice close to how I imagine my own character, and landed on Kasukabe Tsumugi (speaker=8). Bright and energetic — that feels right.

https://voicevox.hiroshiba.jp/product/kasukabe_tsumugi/

Modification Plan

Change 1: Swap LLM to Gemini 2.0 Flash

Remove WebLLM-related code (use-webllm.ts, model download handling)
Create a new /api/chat endpoint
Call the Gemini 2.0 Flash API
Read the API key from environment variable GEMINI_API_KEY

Change 2: Swap TTS to VOICEVOX

Remove Supertonic TTS
Call VOICEVOX API (localhost:50021)
Speaker: Kasukabe Tsumugi (speaker=8)
Two-step process: /audio_query → /synthesis

What Gets Removed

WebLLM-related code (use-webllm.ts, model download handling)
Supertonic TTS
English voice data (public/voices/)

What Stays

Whisper STT (speech recognition)
Silero VAD (voice activity detection)
Basic UI

Before and After

Function	Before	After
Speech recognition	Whisper	Whisper (unchanged)
VAD	Silero	Silero (unchanged)
LLM	WebLLM (Qwen 1.5B)	Gemini 2.0 Flash API
TTS	Supertonic (English)	VOICEVOX (Kasukabe Tsumugi)

Hardware

My machine is an RTX 4060 Laptop (8GB VRAM).

Project requirements:

WebGPU-capable browser (Chrome/Edge)
~4GB RAM

No problem for the 4060 Laptop. WebGPU just means the browser uses the GPU — nothing special needed.

With the Gemini API-based config, the LLM part is offloaded to the cloud, so local load is even lighter:

Whisper (STT): 150MB, fast on local GPU
VAD: 2MB, minimal overhead
TTS: VOICEVOX (separate process)
LLM: Gemini API, zero local load

The main bottleneck will just be network latency.

Actually modify the code and get it running.