Tech 4 min read

Trying to Have a Voice Conversation with AI

What I Want to Build

Speak to it and have AI respond in voice — that’s the system I want to make.

Specifically:

  • Speech recognition to transcribe audio to text
  • LLM to generate a response
  • Text-to-speech to read it aloud

Since it’s a real-time conversation, response latency matters a lot.

A Base Project I Found

I came across irelate-ai/voice-chat on GitHub.

Its default setup:

  • Speech recognition: Whisper (in-browser)
  • VAD: Silero (voice activity detection — automatically detects when you stop speaking)
  • LLM: WebLLM (in-browser Qwen 1.5B)
  • TTS: Supertonic (English-focused)

What’s interesting is that the whole thing is designed to run entirely in the browser, no external APIs required.

A demo is available at https://huggingface.co/spaces/RickRossTN/ai-voice-chat.

The Problems

  1. In-browser LLM is slow: Requires downloading a ~900MB model, and inference is sluggish.
  2. TTS is English-only: Supertonic can’t handle Japanese properly.

Since I want Japanese voice conversation, both components need to be swapped out.

LLM Choice: Gemini 2.0 Flash

Sending requests to a fast API beats running inference locally in the browser for the user experience.

I looked at a few options:

  • Claude API
  • OpenAI API
  • Gemini API
  • Ollama (local)

Why Gemini?

The free tier is generous:

  • Gemini 2.0 Flash: 15 requests/minute, up to 1,500 requests/day free
  • Gemini 1.5 Pro: 2 requests/minute, up to 50 requests/day free

And it’s cheap when you do pay ($0.10 per million input tokens, $0.40 per million output tokens).

Why Flash?

Pro is overkill. Flash is actually the better choice here.

  • Response speed matters enormously for voice conversation
  • Casual conversation doesn’t require Pro-level intelligence
  • Pro is for complex reasoning and long-form analysis — unnecessary here

Speed wins.

An API key is issued instantly from Google AI Studio with a Google account login.

TTS Choice: VOICEVOX

The question of how to handle Japanese text-to-speech.

Options I Considered

MethodSpeedJapanese QualityEffort
VOICEVOXFastGoodEasy (API built-in)
VOICEPEAK CLISlowGoodAwkward
Gemini Live APIFastUnknownNeeds testing
Google Cloud TTSFastGoodAPI billing
Style-Bert-VITS2FastGoodRequires training

VOICEPEAK Doesn’t Work Here

I actually own VOICEPEAK, so it was my first thought.

But VOICEPEAK has no API. You either use the GUI manually or output files via CLI.

CLI like voicepeak -s "text" -o output.wav works, but it goes through a file — not suited for real-time conversation. It takes a few seconds to generate, plus the file write → read → playback overhead.

I looked into whether you could use just the library component externally, but it’s designed to work only with its own engine in a proprietary format. Reverse engineering is prohibited by the terms of service.

VOICEVOX Is the Clear Choice

VOICEVOX installs, launches, and immediately runs an API server at localhost:50021.

# Create synthesis query
curl -X POST "localhost:50021/audio_query?text=Hello&speaker=8"

# Generate audio
curl -X POST "localhost:50021/synthesis?speaker=8" \
  -H "Content-Type: application/json" \
  -d @query.json > output.wav

It’s free, zero configuration, and there’s no other real option.

Speaker Selection

VOICEVOX has a bunch of characters. I listened through several looking for a voice close to how I imagine my own character, and landed on Kasukabe Tsumugi (speaker=8). Bright and energetic — that feels right.

https://voicevox.hiroshiba.jp/product/kasukabe_tsumugi/

Modification Plan

Change 1: Swap LLM to Gemini 2.0 Flash

  • Remove WebLLM-related code (use-webllm.ts, model download handling)
  • Create a new /api/chat endpoint
  • Call the Gemini 2.0 Flash API
  • Read the API key from environment variable GEMINI_API_KEY

Change 2: Swap TTS to VOICEVOX

  • Remove Supertonic TTS
  • Call VOICEVOX API (localhost:50021)
  • Speaker: Kasukabe Tsumugi (speaker=8)
  • Two-step process: /audio_query/synthesis

What Gets Removed

  • WebLLM-related code (use-webllm.ts, model download handling)
  • Supertonic TTS
  • English voice data (public/voices/)

What Stays

  • Whisper STT (speech recognition)
  • Silero VAD (voice activity detection)
  • Basic UI

Before and After

FunctionBeforeAfter
Speech recognitionWhisperWhisper (unchanged)
VADSileroSilero (unchanged)
LLMWebLLM (Qwen 1.5B)Gemini 2.0 Flash API
TTSSupertonic (English)VOICEVOX (Kasukabe Tsumugi)

Hardware

My machine is an RTX 4060 Laptop (8GB VRAM).

Project requirements:

  • WebGPU-capable browser (Chrome/Edge)
  • ~4GB RAM

No problem for the 4060 Laptop. WebGPU just means the browser uses the GPU — nothing special needed.

With the Gemini API-based config, the LLM part is offloaded to the cloud, so local load is even lighter:

  • Whisper (STT): 150MB, fast on local GPU
  • VAD: 2MB, minimal overhead
  • TTS: VOICEVOX (separate process)
  • LLM: Gemini API, zero local load

The main bottleneck will just be network latency.

Next

Actually modify the code and get it running.