Building a Talkable AI Environment (2): Voice Input Implementation
In the previous research post, I compared the available voice API options. This time I’ll dig into the implementation side: how to actually capture audio in the browser.
Two Approaches to Voice Input
There are two main ways to implement voice input in a browser:
- Web Speech API — a built-in browser API that handles recognition automatically and converts speech to text
- MediaRecorder — records audio and returns it as a Blob, which you then send to an API yourself
Which one to use depends on your requirements.
Web Speech API
The browser’s native speech recognition API. It converts spoken input to text in real time.
Basic usage
// Create a SpeechRecognition instance (Chrome-compatible)
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
// Configure
recognition.lang = 'en-US';
recognition.continuous = true; // keep recognizing
recognition.interimResults = true; // return interim results
// Get recognition results
recognition.onresult = (event) => {
const result = event.results[event.results.length - 1];
const transcript = result[0].transcript;
const isFinal = result.isFinal;
if (isFinal) {
console.log('Final:', transcript);
// Send to AI API here
} else {
console.log('Interim:', transcript);
}
};
// Error handling
recognition.onerror = (event) => {
console.error('Recognition error:', event.error);
};
// Start
recognition.start();
Key properties
| Property | Description | Default |
|---|---|---|
lang | Recognition language (en-US, ja-JP, etc.) | Browser setting |
continuous | Keep recognizing after a result | false |
interimResults | Return in-progress results | false |
maxAlternatives | Max number of candidate results | 1 |
Chrome’s implementation quirk
Web Speech API implementation varies by browser:
| Browser | Processing location | Offline |
|---|---|---|
| Chrome/Edge | Google’s servers | No |
| Safari | On-device | Yes |
| Firefox | Not implemented | — |
Chrome sends audio data to Google’s servers for processing. That means:
- Network connection required
- Worth noting from a privacy perspective
- Latency includes a server round-trip
That said, if you’re sending the result to an AI API anyway, you already need a network connection, and the privacy difference is arguably minimal.
Browser support check
const isSupported = 'SpeechRecognition' in window || 'webkitSpeechRecognition' in window;
if (!isSupported) {
alert('Your browser does not support speech recognition');
}
Recording with MediaRecorder → API submission
Instead of using Web Speech API, you can record audio yourself and send it to Whisper API or similar. This gives you finer control.
Basic flow
// 1. Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// 2. Create MediaRecorder
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus' // browser-compatible format
});
const chunks = [];
// 3. Collect data
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
chunks.push(event.data);
}
};
// 4. On stop, create a Blob and send it
mediaRecorder.onstop = async () => {
const blob = new Blob(chunks, { type: 'audio/webm' });
// Send to Whisper API
const formData = new FormData();
formData.append('file', blob, 'audio.webm');
formData.append('model', 'whisper-1');
const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${OPENAI_API_KEY}`
},
body: formData
});
const result = await response.json();
console.log('Transcript:', result.text);
};
// Start recording
mediaRecorder.start();
// Stop after 5 seconds
setTimeout(() => mediaRecorder.stop(), 5000);
Real-time streaming (chunked)
Send audio in small slices as it’s recorded:
const mediaRecorder = new MediaRecorder(stream);
// Emit data every 1 second
mediaRecorder.start(1000);
mediaRecorder.ondataavailable = async (event) => {
if (event.data.size > 0) {
// Send 1-second chunk to API
await sendToAPI(event.data);
}
};
Real-time vs. Post-recording submission
WebSocket (real-time)
const socket = new WebSocket('wss://api.example.com/realtime');
// Send audio data in real time
mediaRecorder.ondataavailable = (event) => {
if (socket.readyState === WebSocket.OPEN) {
socket.send(event.data);
}
};
// Receive recognition results
socket.onmessage = (event) => {
const result = JSON.parse(event.data);
console.log('Transcript:', result.text);
};
Pros:
- Low latency
- Results come back while you’re still talking
Cons:
- More complex to implement
- Limited API support (OpenAI Realtime API, etc.)
REST (submit after recording)
// Send all at once after recording is complete
const blob = new Blob(chunks, { type: 'audio/webm' });
const result = await sendToWhisperAPI(blob);
Pros:
- Simpler to implement
- Wide API support
Cons:
- Results don’t come back until recording finishes
- Longer waits for longer audio
When to use what
| Use case | Recommendation |
|---|---|
| Voice chatbot | WebSocket (response speed matters) |
| Voice memos | REST (batch processing is fine) |
| Transcription | REST (accuracy matters) |
| Quick prototype | Web Speech API (easiest) |
Audio Formats
Browser output and API input formats don’t always match.
Browser side (MediaRecorder)
// Check supported formats
MediaRecorder.isTypeSupported('audio/webm;codecs=opus'); // true (Chrome)
MediaRecorder.isTypeSupported('audio/mp4'); // true (Safari)
MediaRecorder.isTypeSupported('audio/wav'); // false (usually)
| Browser | Default format |
|---|---|
| Chrome | audio/webm;codecs=opus |
| Firefox | audio/webm;codecs=opus |
| Safari | audio/mp4 |
API format support
| API | Supported formats |
|---|---|
| OpenAI Whisper | mp3, mp4, mpeg, mpga, m4a, wav, webm |
| Google Speech-to-Text | FLAC, LINEAR16, MULAW, OGG_OPUS, WEBM_OPUS, etc. |
WebM is supported by most APIs, so sending the browser’s default format usually works fine.
When you need WAV
Some APIs or local processing pipelines require WAV. In that case, conversion is necessary:
// Capture PCM data with AudioContext and convert to WAV
async function recordAsWav(stream, duration) {
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
const pcmData = [];
processor.onaudioprocess = (event) => {
const inputData = event.inputBuffer.getChannelData(0);
pcmData.push(new Float32Array(inputData));
};
source.connect(processor);
processor.connect(audioContext.destination);
// Record
await new Promise(resolve => setTimeout(resolve, duration));
// Convert to WAV
return createWavBlob(pcmData, audioContext.sampleRate);
}
WAV conversion is tedious — use a WebM-compatible API instead if you can.
Recommended Setup by Pattern
Pattern 1: Keep it simple
Web Speech API, no question
const recognition = new webkitSpeechRecognition();
recognition.lang = 'en-US';
recognition.onresult = (e) => sendToAI(e.results[0][0].transcript);
recognition.start();
- Works in 10 lines
- Free
- Fine as long as Chrome’s server-side processing doesn’t bother you
Pattern 2: Accuracy first
MediaRecorder + Whisper API
- Record → send as WebM → high-accuracy result
- Whisper beats Web Speech API on accuracy
- Paid but cheap ($0.006/min)
Pattern 3: Real-time first
OpenAI Realtime API or Gemini Live API
- WebSocket-based bidirectional real-time streaming
- Higher cost (especially OpenAI)
- For serious voice chatbot applications
- Quick start: Web Speech API
- Need accuracy: MediaRecorder + Whisper API
- Real-time conversation: Realtime-class API
Next up: looking into avatar integration (Live2D? VRM?). Once the audio input → AI → audio output pipeline is solid, the next goal is adding a visible avatar.