Building a Talkable AI Environment (2): Voice Input Implementation

In the previous research post, I compared the available voice API options. This time I’ll dig into the implementation side: how to actually capture audio in the browser.

Two Approaches to Voice Input

There are two main ways to implement voice input in a browser:

Web Speech API — a built-in browser API that handles recognition automatically and converts speech to text
MediaRecorder — records audio and returns it as a Blob, which you then send to an API yourself

Which one to use depends on your requirements.

Web Speech API

The browser’s native speech recognition API. It converts spoken input to text in real time.

Basic usage

// Create a SpeechRecognition instance (Chrome-compatible)
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();

// Configure
recognition.lang = 'en-US';
recognition.continuous = true;      // keep recognizing
recognition.interimResults = true;  // return interim results

// Get recognition results
recognition.onresult = (event) => {
  const result = event.results[event.results.length - 1];
  const transcript = result[0].transcript;
  const isFinal = result.isFinal;

  if (isFinal) {
    console.log('Final:', transcript);
    // Send to AI API here
  } else {
    console.log('Interim:', transcript);
  }
};

// Error handling
recognition.onerror = (event) => {
  console.error('Recognition error:', event.error);
};

// Start
recognition.start();

Key properties

Property	Description	Default
`lang`	Recognition language (`en-US`, `ja-JP`, etc.)	Browser setting
`continuous`	Keep recognizing after a result	`false`
`interimResults`	Return in-progress results	`false`
`maxAlternatives`	Max number of candidate results	`1`

Chrome’s implementation quirk

Web Speech API implementation varies by browser:

Browser	Processing location	Offline
Chrome/Edge	Google’s servers	No
Safari	On-device	Yes
Firefox	Not implemented	—

Chrome sends audio data to Google’s servers for processing. That means:

Network connection required
Worth noting from a privacy perspective
Latency includes a server round-trip

That said, if you’re sending the result to an AI API anyway, you already need a network connection, and the privacy difference is arguably minimal.

Browser support check

const isSupported = 'SpeechRecognition' in window || 'webkitSpeechRecognition' in window;

if (!isSupported) {
  alert('Your browser does not support speech recognition');
}

Recording with MediaRecorder → API submission

Instead of using Web Speech API, you can record audio yourself and send it to Whisper API or similar. This gives you finer control.

Basic flow

// 1. Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

// 2. Create MediaRecorder
const mediaRecorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus'  // browser-compatible format
});

const chunks = [];

// 3. Collect data
mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    chunks.push(event.data);
  }
};

// 4. On stop, create a Blob and send it
mediaRecorder.onstop = async () => {
  const blob = new Blob(chunks, { type: 'audio/webm' });

  // Send to Whisper API
  const formData = new FormData();
  formData.append('file', blob, 'audio.webm');
  formData.append('model', 'whisper-1');

  const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    },
    body: formData
  });

  const result = await response.json();
  console.log('Transcript:', result.text);
};

// Start recording
mediaRecorder.start();

// Stop after 5 seconds
setTimeout(() => mediaRecorder.stop(), 5000);

Real-time streaming (chunked)

Send audio in small slices as it’s recorded:

const mediaRecorder = new MediaRecorder(stream);

// Emit data every 1 second
mediaRecorder.start(1000);

mediaRecorder.ondataavailable = async (event) => {
  if (event.data.size > 0) {
    // Send 1-second chunk to API
    await sendToAPI(event.data);
  }
};

Real-time vs. Post-recording submission

WebSocket (real-time)

const socket = new WebSocket('wss://api.example.com/realtime');

// Send audio data in real time
mediaRecorder.ondataavailable = (event) => {
  if (socket.readyState === WebSocket.OPEN) {
    socket.send(event.data);
  }
};

// Receive recognition results
socket.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Transcript:', result.text);
};

Pros:

Low latency
Results come back while you’re still talking

Cons:

More complex to implement
Limited API support (OpenAI Realtime API, etc.)

REST (submit after recording)

// Send all at once after recording is complete
const blob = new Blob(chunks, { type: 'audio/webm' });
const result = await sendToWhisperAPI(blob);

Pros:

Simpler to implement
Wide API support

Cons:

Results don’t come back until recording finishes
Longer waits for longer audio

When to use what

Use case	Recommendation
Voice chatbot	WebSocket (response speed matters)
Voice memos	REST (batch processing is fine)
Transcription	REST (accuracy matters)
Quick prototype	Web Speech API (easiest)

Audio Formats

Browser output and API input formats don’t always match.

Browser side (MediaRecorder)

// Check supported formats
MediaRecorder.isTypeSupported('audio/webm;codecs=opus');  // true (Chrome)
MediaRecorder.isTypeSupported('audio/mp4');               // true (Safari)
MediaRecorder.isTypeSupported('audio/wav');               // false (usually)

Browser	Default format
Chrome	`audio/webm;codecs=opus`
Firefox	`audio/webm;codecs=opus`
Safari	`audio/mp4`

API format support

API	Supported formats
OpenAI Whisper	mp3, mp4, mpeg, mpga, m4a, wav, webm
Google Speech-to-Text	FLAC, LINEAR16, MULAW, OGG_OPUS, WEBM_OPUS, etc.

WebM is supported by most APIs, so sending the browser’s default format usually works fine.

When you need WAV

Some APIs or local processing pipelines require WAV. In that case, conversion is necessary:

// Capture PCM data with AudioContext and convert to WAV
async function recordAsWav(stream, duration) {
  const audioContext = new AudioContext();
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(4096, 1, 1);

  const pcmData = [];

  processor.onaudioprocess = (event) => {
    const inputData = event.inputBuffer.getChannelData(0);
    pcmData.push(new Float32Array(inputData));
  };

  source.connect(processor);
  processor.connect(audioContext.destination);

  // Record
  await new Promise(resolve => setTimeout(resolve, duration));

  // Convert to WAV
  return createWavBlob(pcmData, audioContext.sampleRate);
}

WAV conversion is tedious — use a WebM-compatible API instead if you can.

Recommended Setup by Pattern

Pattern 1: Keep it simple

Web Speech API, no question

const recognition = new webkitSpeechRecognition();
recognition.lang = 'en-US';
recognition.onresult = (e) => sendToAI(e.results[0][0].transcript);
recognition.start();

Works in 10 lines
Free
Fine as long as Chrome’s server-side processing doesn’t bother you

Pattern 2: Accuracy first

MediaRecorder + Whisper API

Record → send as WebM → high-accuracy result
Whisper beats Web Speech API on accuracy
Paid but cheap ($0.006/min)

Pattern 3: Real-time first

OpenAI Realtime API or Gemini Live API

WebSocket-based bidirectional real-time streaming
Higher cost (especially OpenAI)
For serious voice chatbot applications

Quick start: Web Speech API
Need accuracy: MediaRecorder + Whisper API
Real-time conversation: Realtime-class API

Next up: looking into avatar integration (Live2D? VRM?). Once the audio input → AI → audio output pipeline is solid, the next goal is adding a visible avatar.