Tech 6 min read

Building a Talkable AI Environment (2): Voice Input Implementation

In the previous research post, I compared the available voice API options. This time I’ll dig into the implementation side: how to actually capture audio in the browser.

Two Approaches to Voice Input

There are two main ways to implement voice input in a browser:

  1. Web Speech API — a built-in browser API that handles recognition automatically and converts speech to text
  2. MediaRecorder — records audio and returns it as a Blob, which you then send to an API yourself

Which one to use depends on your requirements.

Web Speech API

The browser’s native speech recognition API. It converts spoken input to text in real time.

Basic usage

// Create a SpeechRecognition instance (Chrome-compatible)
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();

// Configure
recognition.lang = 'en-US';
recognition.continuous = true;      // keep recognizing
recognition.interimResults = true;  // return interim results

// Get recognition results
recognition.onresult = (event) => {
  const result = event.results[event.results.length - 1];
  const transcript = result[0].transcript;
  const isFinal = result.isFinal;

  if (isFinal) {
    console.log('Final:', transcript);
    // Send to AI API here
  } else {
    console.log('Interim:', transcript);
  }
};

// Error handling
recognition.onerror = (event) => {
  console.error('Recognition error:', event.error);
};

// Start
recognition.start();

Key properties

PropertyDescriptionDefault
langRecognition language (en-US, ja-JP, etc.)Browser setting
continuousKeep recognizing after a resultfalse
interimResultsReturn in-progress resultsfalse
maxAlternativesMax number of candidate results1

Chrome’s implementation quirk

Web Speech API implementation varies by browser:

BrowserProcessing locationOffline
Chrome/EdgeGoogle’s serversNo
SafariOn-deviceYes
FirefoxNot implemented

Chrome sends audio data to Google’s servers for processing. That means:

  • Network connection required
  • Worth noting from a privacy perspective
  • Latency includes a server round-trip

That said, if you’re sending the result to an AI API anyway, you already need a network connection, and the privacy difference is arguably minimal.

Browser support check

const isSupported = 'SpeechRecognition' in window || 'webkitSpeechRecognition' in window;

if (!isSupported) {
  alert('Your browser does not support speech recognition');
}

Recording with MediaRecorder → API submission

Instead of using Web Speech API, you can record audio yourself and send it to Whisper API or similar. This gives you finer control.

Basic flow

// 1. Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

// 2. Create MediaRecorder
const mediaRecorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus'  // browser-compatible format
});

const chunks = [];

// 3. Collect data
mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    chunks.push(event.data);
  }
};

// 4. On stop, create a Blob and send it
mediaRecorder.onstop = async () => {
  const blob = new Blob(chunks, { type: 'audio/webm' });

  // Send to Whisper API
  const formData = new FormData();
  formData.append('file', blob, 'audio.webm');
  formData.append('model', 'whisper-1');

  const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    },
    body: formData
  });

  const result = await response.json();
  console.log('Transcript:', result.text);
};

// Start recording
mediaRecorder.start();

// Stop after 5 seconds
setTimeout(() => mediaRecorder.stop(), 5000);

Real-time streaming (chunked)

Send audio in small slices as it’s recorded:

const mediaRecorder = new MediaRecorder(stream);

// Emit data every 1 second
mediaRecorder.start(1000);

mediaRecorder.ondataavailable = async (event) => {
  if (event.data.size > 0) {
    // Send 1-second chunk to API
    await sendToAPI(event.data);
  }
};

Real-time vs. Post-recording submission

WebSocket (real-time)

const socket = new WebSocket('wss://api.example.com/realtime');

// Send audio data in real time
mediaRecorder.ondataavailable = (event) => {
  if (socket.readyState === WebSocket.OPEN) {
    socket.send(event.data);
  }
};

// Receive recognition results
socket.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Transcript:', result.text);
};

Pros:

  • Low latency
  • Results come back while you’re still talking

Cons:

  • More complex to implement
  • Limited API support (OpenAI Realtime API, etc.)

REST (submit after recording)

// Send all at once after recording is complete
const blob = new Blob(chunks, { type: 'audio/webm' });
const result = await sendToWhisperAPI(blob);

Pros:

  • Simpler to implement
  • Wide API support

Cons:

  • Results don’t come back until recording finishes
  • Longer waits for longer audio

When to use what

Use caseRecommendation
Voice chatbotWebSocket (response speed matters)
Voice memosREST (batch processing is fine)
TranscriptionREST (accuracy matters)
Quick prototypeWeb Speech API (easiest)

Audio Formats

Browser output and API input formats don’t always match.

Browser side (MediaRecorder)

// Check supported formats
MediaRecorder.isTypeSupported('audio/webm;codecs=opus');  // true (Chrome)
MediaRecorder.isTypeSupported('audio/mp4');               // true (Safari)
MediaRecorder.isTypeSupported('audio/wav');               // false (usually)
BrowserDefault format
Chromeaudio/webm;codecs=opus
Firefoxaudio/webm;codecs=opus
Safariaudio/mp4

API format support

APISupported formats
OpenAI Whispermp3, mp4, mpeg, mpga, m4a, wav, webm
Google Speech-to-TextFLAC, LINEAR16, MULAW, OGG_OPUS, WEBM_OPUS, etc.

WebM is supported by most APIs, so sending the browser’s default format usually works fine.

When you need WAV

Some APIs or local processing pipelines require WAV. In that case, conversion is necessary:

// Capture PCM data with AudioContext and convert to WAV
async function recordAsWav(stream, duration) {
  const audioContext = new AudioContext();
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(4096, 1, 1);

  const pcmData = [];

  processor.onaudioprocess = (event) => {
    const inputData = event.inputBuffer.getChannelData(0);
    pcmData.push(new Float32Array(inputData));
  };

  source.connect(processor);
  processor.connect(audioContext.destination);

  // Record
  await new Promise(resolve => setTimeout(resolve, duration));

  // Convert to WAV
  return createWavBlob(pcmData, audioContext.sampleRate);
}

WAV conversion is tedious — use a WebM-compatible API instead if you can.

Pattern 1: Keep it simple

Web Speech API, no question

const recognition = new webkitSpeechRecognition();
recognition.lang = 'en-US';
recognition.onresult = (e) => sendToAI(e.results[0][0].transcript);
recognition.start();
  • Works in 10 lines
  • Free
  • Fine as long as Chrome’s server-side processing doesn’t bother you

Pattern 2: Accuracy first

MediaRecorder + Whisper API

  • Record → send as WebM → high-accuracy result
  • Whisper beats Web Speech API on accuracy
  • Paid but cheap ($0.006/min)

Pattern 3: Real-time first

OpenAI Realtime API or Gemini Live API

  • WebSocket-based bidirectional real-time streaming
  • Higher cost (especially OpenAI)
  • For serious voice chatbot applications

  • Quick start: Web Speech API
  • Need accuracy: MediaRecorder + Whisper API
  • Real-time conversation: Realtime-class API

Next up: looking into avatar integration (Live2D? VRM?). Once the audio input → AI → audio output pipeline is solid, the next goal is adding a visible avatar.