Why WebRTC Audio Won't Work with the SpeechRecognition API — and What to Do About It

I was building a WebRTC voice call for a side project and thought, “wouldn’t it be cool to add auto-translation?” The plan seemed simple: receive audio over WebRTC, convert it to text with the SpeechRecognition API, then send it to a translation API.

Turns out you can’t pass a WebRTC MediaStream to the SpeechRecognition API. After researching and experimenting, I found three workable approaches, so let me lay them out here.

This blog also has articles on WebRTC signaling via QR code for a P2P voice chat tool with DataChannel, and stabilizing the Web Speech API on iOS — worth reading alongside this one.

The Core Problem: The MediaStream Wall

The SpeechRecognition API only recognizes audio from the local microphone.

// This won't work
peerConnection.ontrack = (event) => {
  const remoteStream = event.streams[0];
  const recognition = new webkitSpeechRecognition();

  // There's no way to pass remoteStream to SpeechRecognition
  recognition.start(); // Only picks up the local microphone
};

A local MediaStream from getUserMedia() works fine, but a remote MediaStream received via RTCPeerConnection.ontrack is out of scope.

Why? By spec, the SpeechRecognition API targets only the user’s audio input device (the microphone). There are security reasons for this, and the browser implementation assumes direct access to the local mic.

In other words, “recognize the remote party’s audio on this side” doesn’t work. You need to rethink the approach.

Approach A: Remote-Side Recognition (Simplest)

Perform speech recognition on the speaking side. Send the transcribed text via DataChannel.

// Sender (the speaker)
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false; // final results only
recognition.lang = 'ja-JP';

recognition.onresult = (event) => {
  for (let i = event.resultIndex; i < event.results.length; i++) {
    if (event.results[i].isFinal) {
      const text = event.results[i][0].transcript;

      // Send text via DataChannel
      dataChannel.send(JSON.stringify({
        type: 'transcript',
        lang: 'ja',
        text: text
      }));
    }
  }
};

// Receiver
dataChannel.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'transcript') {
    // Pass to translation API and display the result
    translateAndDisplay(data.text, data.lang);
  }
};

Architecture:

[User A] Mic → SpeechRecognition → Text
                                     ↓
                                DataChannel
                                     ↓
[User B] ← Translation result ← Translation API ← Received text

Pros:

Simple to implement
No extra cost (Web Speech API is free)
Low latency (only text is transmitted)
No server needed (fully P2P)

Cons:

Depends on the other party’s browser
Recognition quality is up to the browser
May be unstable on iOS Safari (more on this below)

For prototypes or casual use, this is more than enough.

Approach B: Server-Side Recognition (Higher Accuracy)

Send audio to a server and recognize it with Whisper API or similar.

// Record audio with MediaRecorder
const mediaRecorder = new MediaRecorder(localStream);
const socket = new WebSocket('wss://your-server.com/transcribe');

mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    // Send audio chunk to server
    socket.send(event.data);
  }
};

mediaRecorder.start(1000); // Send chunks every 1 second

// Server calls Whisper API
socket.onmessage = (event) => {
  const { text } = JSON.parse(event.data);
  translateAndDisplay(text);
};

Pros:

High accuracy (can use commercial services like Whisper API)
Choice of language model
Logging and analysis possible
No browser dependency

Cons:

Costs money (pay-per-use)
Higher latency (audio upload + API processing)
Server management required
WebSocket connection management required

Speech recognition API comparison:

API	Price	Accuracy	Latency
Web Speech API	Free	Medium	Low
Whisper API	$0.006/min	High	Medium
Google Speech-to-Text	$0.016/min	High	Low

For business use or meeting transcription where accuracy matters, this is the better choice.

Approach C: AudioContext Experiment (Not Recommended, but Documented)

Capturing a MediaStream with AudioContext and sending it to an external API is theoretically possible.

// Experimental: getting audio data via AudioContext
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(remoteStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  const audioData = event.inputBuffer.getChannelData(0);

  // This PCM data needs to go somewhere, but...
  // Can't pass it to SpeechRecognition API
  // → You still need an external API
};

Problems:

ScriptProcessorNode is deprecated (should use AudioWorklet, which is even more complex)
You still need an external API to send PCM data
High implementation cost for the same result as Approach B

Documenting this for reference, but I don’t recommend it. Just go with Approach B.

Push-to-Talk Implementation

Continuous recognition can cause false recognitions and drain the battery. Using a Push-to-Talk (PTT) button to toggle recognition is more practical.

const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'ja-JP';

const pttButton = document.getElementById('ptt-btn');
let currentState = 'idle'; // idle / starting / listening / processing

pttButton.addEventListener('pointerdown', () => {
  currentState = 'starting';
  updateUI(); // Show "Getting ready..."
  recognition.start();
});

recognition.onstart = () => {
  currentState = 'listening';
  updateUI(); // Show "Go ahead"
};

pttButton.addEventListener('pointerup', () => {
  if (currentState === 'listening') {
    currentState = 'processing';
    updateUI(); // Show "Recognizing..."
    recognition.stop();
  }
});

recognition.onresult = (event) => {
  // Text sending logic
};

recognition.onend = () => {
  currentState = 'idle';
  updateUI(); // Show "Hold button to speak"
};

function updateUI() {
  const messages = {
    idle: '🎤 Hold button to speak',
    starting: '⏳ Getting ready...',
    listening: '🔴 Go ahead',
    processing: '💭 Recognizing...'
  };
  document.getElementById('status').textContent = messages[currentState];
  pttButton.dataset.state = currentState;
}

Key points for state management:

Having a starting state lets you tell the user “not yet” until onstart fires
Using pointerdown / pointerup handles both touch and mouse
Showing “Go ahead” only after onstart fires accounts for iOS’s startup lag

See Stabilizing the Web Speech API on iOS for details.

iOS-Specific Problems and Solutions

Problem 1: Beeping Sound When Creating New Instances

On iOS, calling new webkitSpeechRecognition() each time triggers the startup sound every time, and may even bring up the permission dialog repeatedly.

Solution: Singleton pattern

// Bad: creates a new instance every time → dialog + beep every time
pttButton.addEventListener('pointerdown', () => {
  const recognition = new webkitSpeechRecognition(); // Problem here
  recognition.start();
});

// Good: reuse the same instance
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'ja-JP';

pttButton.addEventListener('pointerdown', () => {
  recognition.start(); // Same instance
});

Reusing the instance means the permission dialog only appears the first time, and subsequent uses start silently and smoothly.

Problem 2: Microphone Conflict with WebRTC

On iOS, getUserMedia() and SpeechRecognition are treated as separate microphone sessions. One may block the other, or the permission dialog may appear every time.

Solution: Pause with track.enabled

pttButton.addEventListener('pointerdown', () => {
  // Disable the local stream's audio track
  localStream.getAudioTracks().forEach(track => {
    track.enabled = false;
  });

  recognition.start();
});

pttButton.addEventListener('pointerup', () => {
  recognition.stop();
});

recognition.onend = () => {
  // Re-enable the track
  localStream.getAudioTracks().forEach(track => {
    track.enabled = true;
  });
};

While you’re in PTT mode the WebRTC audio goes silent, but since you’re the one speaking at that moment it doesn’t matter — it actually feels natural.

iOS-specific issues summary:

Problem	Cause	Solution
Beeping sound	Creating new instances	Singleton
Microphone conflict	Simultaneous use with WebRTC	`track.enabled = false`
First recognition fails	Insufficient warm-up	Run `getUserMedia` first

See Stabilizing the Web Speech API on iOS for details.

Translation API Options

Once you have the text, the next step is translation. A few options:

API	Free tier	Accuracy	Latency	Limits
DeepL	500K chars/month	High	Low	Registration required
Google Translate	Unofficial only	Medium	Low	Unstable
MyMemory	5,000 chars/day	Low	Low	Per-IP limit
OpenAI GPT	$0.002/1K tokens	High	Medium	API key required

Recommendations:

Prototype: MyMemory (free, lenient limits)
Production: DeepL (excellent Japanese-English quality, fast responses)
Multi-language: Google Translate (widest language coverage)
Context-aware: OpenAI GPT (understands conversational context)

This blog’s text translation tool uses the MyMemory API — the most practical option for free testing.

For real-time conversation, DeepL feels best in practice. It handles Japanese well and responds quickly.

Things to watch out for:

Don’t send interim results to the translation API — use only isFinal to avoid hammering it
Detecting natural speech boundaries is trickier than it sounds
If you’re sending text via DataChannel, settle on a protocol design upfront (e.g., {type, lang, text, isFinal})

Comparing the Three Approaches

How to choose between them.

Decision flow

START
  ↓
[Q1] Want to avoid costs?
  ├─ YES → Approach A (remote-side recognition)
  └─ NO → Q2
  ↓
[Q2] Do you need high accuracy?
  ├─ YES → Approach B (server-side recognition)
  └─ NO → Approach A is sufficient

Use case recommendations

Use case	Recommended approach	Why
Casual 1-on-1 calls	A (remote-side)	No cost, sufficient accuracy
Business meetings	B (server-side)	High accuracy, logging
Prototype	A (remote-side)	Fastest to implement
Multi-language required	B (server-side)	Flexible model choice

Trade-offs

	Cost	Accuracy	Latency	Implementation
A: Remote-side	Free	Medium	Low	Low
B: Server-side	Paid	High	Medium	Medium
C: AudioContext	Paid	High	Medium	High

A practical approach is to start with Approach A and migrate to B if accuracy becomes insufficient.

Implementation Checklist

Things to verify when implementing:

Environment & permissions:

Running over HTTPS (except localhost)
Microphone permission acquired via user gesture
SpeechRecognition created as a singleton on iOS
WebRTC tracks temporarily disabled during PTT

Error handling:

Handle no-speech error (silence detected)
Handle audio-capture error (mic acquisition failure)
Handle not-allowed error (permission denied)
State recovery on DataChannel disconnect

Compatibility:

UI for browsers without support
iOS Safari-specific workarounds
Tested on Chrome/Edge/Safari

UX:

Visual feedback for recognition state
Long-press support for PTT button
Timing of translation result display
Notification on network errors

References:

Audio feedback prevention — WebRTC audio settings
WebRTC signaling via QR code — DataChannel implementation

The constraint that “WebRTC audio can’t be recognized directly” looks inconvenient at first. But think about it — having a choice of who does the recognition and where isn’t actually bad.

In-browser (Approach A) or server-side (Approach B) each have their merits. Constraints create design freedom.

This blog has a working P2P voice chat tool and a speech recognition test. The DataChannel text send/receive portion should be directly applicable. Translation isn’t implemented yet, but combining them would make it work.

A real-time translation system is technically achievable with “WebRTC + SpeechRecognition + translation API”. The rest is about how much you polish the user experience.