Tech 9 min read

Why WebRTC Audio Won't Work with the SpeechRecognition API — and What to Do About It

I was building a WebRTC voice call for a side project and thought, “wouldn’t it be cool to add auto-translation?” The plan seemed simple: receive audio over WebRTC, convert it to text with the SpeechRecognition API, then send it to a translation API.

Turns out you can’t pass a WebRTC MediaStream to the SpeechRecognition API. After researching and experimenting, I found three workable approaches, so let me lay them out here.

This blog also has articles on WebRTC signaling via QR code for a P2P voice chat tool with DataChannel, and stabilizing the Web Speech API on iOS — worth reading alongside this one.

The Core Problem: The MediaStream Wall

The SpeechRecognition API only recognizes audio from the local microphone.

// This won't work
peerConnection.ontrack = (event) => {
  const remoteStream = event.streams[0];
  const recognition = new webkitSpeechRecognition();

  // There's no way to pass remoteStream to SpeechRecognition
  recognition.start(); // Only picks up the local microphone
};

A local MediaStream from getUserMedia() works fine, but a remote MediaStream received via RTCPeerConnection.ontrack is out of scope.

Why? By spec, the SpeechRecognition API targets only the user’s audio input device (the microphone). There are security reasons for this, and the browser implementation assumes direct access to the local mic.

In other words, “recognize the remote party’s audio on this side” doesn’t work. You need to rethink the approach.

Approach A: Remote-Side Recognition (Simplest)

Perform speech recognition on the speaking side. Send the transcribed text via DataChannel.

// Sender (the speaker)
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false; // final results only
recognition.lang = 'ja-JP';

recognition.onresult = (event) => {
  for (let i = event.resultIndex; i < event.results.length; i++) {
    if (event.results[i].isFinal) {
      const text = event.results[i][0].transcript;

      // Send text via DataChannel
      dataChannel.send(JSON.stringify({
        type: 'transcript',
        lang: 'ja',
        text: text
      }));
    }
  }
};

// Receiver
dataChannel.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'transcript') {
    // Pass to translation API and display the result
    translateAndDisplay(data.text, data.lang);
  }
};

Architecture:

[User A] Mic → SpeechRecognition → Text

                                DataChannel

[User B] ← Translation result ← Translation API ← Received text

Pros:

  • Simple to implement
  • No extra cost (Web Speech API is free)
  • Low latency (only text is transmitted)
  • No server needed (fully P2P)

Cons:

  • Depends on the other party’s browser
  • Recognition quality is up to the browser
  • May be unstable on iOS Safari (more on this below)

For prototypes or casual use, this is more than enough.

Approach B: Server-Side Recognition (Higher Accuracy)

Send audio to a server and recognize it with Whisper API or similar.

// Record audio with MediaRecorder
const mediaRecorder = new MediaRecorder(localStream);
const socket = new WebSocket('wss://your-server.com/transcribe');

mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0) {
    // Send audio chunk to server
    socket.send(event.data);
  }
};

mediaRecorder.start(1000); // Send chunks every 1 second

// Server calls Whisper API
socket.onmessage = (event) => {
  const { text } = JSON.parse(event.data);
  translateAndDisplay(text);
};

Pros:

  • High accuracy (can use commercial services like Whisper API)
  • Choice of language model
  • Logging and analysis possible
  • No browser dependency

Cons:

  • Costs money (pay-per-use)
  • Higher latency (audio upload + API processing)
  • Server management required
  • WebSocket connection management required

Speech recognition API comparison:

APIPriceAccuracyLatency
Web Speech APIFreeMediumLow
Whisper API$0.006/minHighMedium
Google Speech-to-Text$0.016/minHighLow

For business use or meeting transcription where accuracy matters, this is the better choice.

Capturing a MediaStream with AudioContext and sending it to an external API is theoretically possible.

// Experimental: getting audio data via AudioContext
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(remoteStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  const audioData = event.inputBuffer.getChannelData(0);

  // This PCM data needs to go somewhere, but...
  // Can't pass it to SpeechRecognition API
  // → You still need an external API
};

Problems:

  • ScriptProcessorNode is deprecated (should use AudioWorklet, which is even more complex)
  • You still need an external API to send PCM data
  • High implementation cost for the same result as Approach B

Documenting this for reference, but I don’t recommend it. Just go with Approach B.

Push-to-Talk Implementation

Continuous recognition can cause false recognitions and drain the battery. Using a Push-to-Talk (PTT) button to toggle recognition is more practical.

const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'ja-JP';

const pttButton = document.getElementById('ptt-btn');
let currentState = 'idle'; // idle / starting / listening / processing

pttButton.addEventListener('pointerdown', () => {
  currentState = 'starting';
  updateUI(); // Show "Getting ready..."
  recognition.start();
});

recognition.onstart = () => {
  currentState = 'listening';
  updateUI(); // Show "Go ahead"
};

pttButton.addEventListener('pointerup', () => {
  if (currentState === 'listening') {
    currentState = 'processing';
    updateUI(); // Show "Recognizing..."
    recognition.stop();
  }
});

recognition.onresult = (event) => {
  // Text sending logic
};

recognition.onend = () => {
  currentState = 'idle';
  updateUI(); // Show "Hold button to speak"
};

function updateUI() {
  const messages = {
    idle: '🎤 Hold button to speak',
    starting: '⏳ Getting ready...',
    listening: '🔴 Go ahead',
    processing: '💭 Recognizing...'
  };
  document.getElementById('status').textContent = messages[currentState];
  pttButton.dataset.state = currentState;
}

Key points for state management:

  • Having a starting state lets you tell the user “not yet” until onstart fires
  • Using pointerdown / pointerup handles both touch and mouse
  • Showing “Go ahead” only after onstart fires accounts for iOS’s startup lag

See Stabilizing the Web Speech API on iOS for details.

iOS-Specific Problems and Solutions

Problem 1: Beeping Sound When Creating New Instances

On iOS, calling new webkitSpeechRecognition() each time triggers the startup sound every time, and may even bring up the permission dialog repeatedly.

Solution: Singleton pattern

// Bad: creates a new instance every time → dialog + beep every time
pttButton.addEventListener('pointerdown', () => {
  const recognition = new webkitSpeechRecognition(); // Problem here
  recognition.start();
});

// Good: reuse the same instance
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'ja-JP';

pttButton.addEventListener('pointerdown', () => {
  recognition.start(); // Same instance
});

Reusing the instance means the permission dialog only appears the first time, and subsequent uses start silently and smoothly.

Problem 2: Microphone Conflict with WebRTC

On iOS, getUserMedia() and SpeechRecognition are treated as separate microphone sessions. One may block the other, or the permission dialog may appear every time.

Solution: Pause with track.enabled

pttButton.addEventListener('pointerdown', () => {
  // Disable the local stream's audio track
  localStream.getAudioTracks().forEach(track => {
    track.enabled = false;
  });

  recognition.start();
});

pttButton.addEventListener('pointerup', () => {
  recognition.stop();
});

recognition.onend = () => {
  // Re-enable the track
  localStream.getAudioTracks().forEach(track => {
    track.enabled = true;
  });
};

While you’re in PTT mode the WebRTC audio goes silent, but since you’re the one speaking at that moment it doesn’t matter — it actually feels natural.

iOS-specific issues summary:

ProblemCauseSolution
Beeping soundCreating new instancesSingleton
Microphone conflictSimultaneous use with WebRTCtrack.enabled = false
First recognition failsInsufficient warm-upRun getUserMedia first

See Stabilizing the Web Speech API on iOS for details.

Translation API Options

Once you have the text, the next step is translation. A few options:

APIFree tierAccuracyLatencyLimits
DeepL500K chars/monthHighLowRegistration required
Google TranslateUnofficial onlyMediumLowUnstable
MyMemory5,000 chars/dayLowLowPer-IP limit
OpenAI GPT$0.002/1K tokensHighMediumAPI key required

Recommendations:

  • Prototype: MyMemory (free, lenient limits)
  • Production: DeepL (excellent Japanese-English quality, fast responses)
  • Multi-language: Google Translate (widest language coverage)
  • Context-aware: OpenAI GPT (understands conversational context)

This blog’s text translation tool uses the MyMemory API — the most practical option for free testing.

For real-time conversation, DeepL feels best in practice. It handles Japanese well and responds quickly.

Things to watch out for:

  • Don’t send interim results to the translation API — use only isFinal to avoid hammering it
  • Detecting natural speech boundaries is trickier than it sounds
  • If you’re sending text via DataChannel, settle on a protocol design upfront (e.g., {type, lang, text, isFinal})

Comparing the Three Approaches

How to choose between them.

Decision flow

START

[Q1] Want to avoid costs?
  ├─ YES → Approach A (remote-side recognition)
  └─ NO → Q2

[Q2] Do you need high accuracy?
  ├─ YES → Approach B (server-side recognition)
  └─ NO → Approach A is sufficient

Use case recommendations

Use caseRecommended approachWhy
Casual 1-on-1 callsA (remote-side)No cost, sufficient accuracy
Business meetingsB (server-side)High accuracy, logging
PrototypeA (remote-side)Fastest to implement
Multi-language requiredB (server-side)Flexible model choice

Trade-offs

CostAccuracyLatencyImplementation
A: Remote-sideFreeMediumLowLow
B: Server-sidePaidHighMediumMedium
C: AudioContextPaidHighMediumHigh

A practical approach is to start with Approach A and migrate to B if accuracy becomes insufficient.

Implementation Checklist

Things to verify when implementing:

Environment & permissions:

  • Running over HTTPS (except localhost)
  • Microphone permission acquired via user gesture
  • SpeechRecognition created as a singleton on iOS
  • WebRTC tracks temporarily disabled during PTT

Error handling:

  • Handle no-speech error (silence detected)
  • Handle audio-capture error (mic acquisition failure)
  • Handle not-allowed error (permission denied)
  • State recovery on DataChannel disconnect

Compatibility:

  • UI for browsers without support
  • iOS Safari-specific workarounds
  • Tested on Chrome/Edge/Safari

UX:

  • Visual feedback for recognition state
  • Long-press support for PTT button
  • Timing of translation result display
  • Notification on network errors

References:


The constraint that “WebRTC audio can’t be recognized directly” looks inconvenient at first. But think about it — having a choice of who does the recognition and where isn’t actually bad.

In-browser (Approach A) or server-side (Approach B) each have their merits. Constraints create design freedom.

This blog has a working P2P voice chat tool and a speech recognition test. The DataChannel text send/receive portion should be directly applicable. Translation isn’t implemented yet, but combining them would make it work.

A real-time translation system is technically achievable with “WebRTC + SpeechRecognition + translation API”. The rest is about how much you polish the user experience.