How to Stabilize the WebSpeech API on iOS

I tried porting the voice chat I built in AI Voice Chat (3) — Finally got it talking to the web. It ran smoothly on a PC, but when I spoke from an iPhone, input would arrive in choppy fragments or it would stop responding altogether. Even though it’s the same Apple ecosystem, Safari on macOS was fine—only iOS misbehaved. iOS’s WebSpeech API has many issues such as “stopping on its own,” “buffer clogging,” and “no recognition on the first attempt.”

You could solve this by using paid services like the Whisper API, but here are practical, no-cost countermeasures.

Basic Approach

Singleton instance — Don’t new it every time (prevents the system chime).
Push-to-talk — More stable than auto-restart with continuous.
Warm up the mic beforehand — Mitigates first-recognition failure.

Implementation Example

// シングルトンで生成（ページ読み込み時に1回だけ）
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.lang = 'ja-JP';
recognition.interimResults = true;
recognition.continuous = true;

const btn = document.getElementById('micBtn');

// Push-to-Talk
btn.addEventListener('touchstart', (e) => {
  e.preventDefault();
  recognition.start();
});

btn.addEventListener('touchend', () => {
  recognition.stop();
});

// ボタン外に指が出た時も止める
btn.addEventListener('touchcancel', () => recognition.stop());

// PC対応
btn.addEventListener('mousedown', () => recognition.start());
btn.addEventListener('mouseup', () => recognition.stop());
btn.addEventListener('mouseleave', () => recognition.stop());

// 結果処理
recognition.onresult = (event) => {
  for (let i = event.resultIndex; i < event.results.length; ++i) {
    if (event.results[i].isFinal) {
      const text = event.results[i][0].transcript;
      console.log('認識結果:', text);
      // ここでUIに反映
    }
  }
};

recognition.onerror = (event) => {
  console.warn('エラー:', event.error);
};

Countermeasures for First-Recognition Failure

Warm up the mic in advance

async function warmupMic() {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    stream.getTracks().forEach(track => track.stop());
  } catch (e) {
    console.warn('マイク許可が必要です');
  }
}

// 初回ユーザージェスチャーで呼ぶ
document.body.addEventListener('click', () => {
  warmupMic();
}, { once: true });

Unlock AudioContext

function unlockAudio() {
  const ctx = new (window.AudioContext || window.webkitAudioContext)();
  const buf = ctx.createBuffer(1, 1, 22050);
  const src = ctx.createBufferSource();
  src.buffer = buf;
  src.connect(ctx.destination);
  src.start(0);
  ctx.resume();
}

Prime recognition with an empty run

function preloadRecognition() {
  recognition.start();
  setTimeout(() => recognition.stop(), 100);
}

Visual Feedback: “You can start talking”

Starting the mic takes a moment, so communicate the wait to the user.

btn.addEventListener('touchstart', async (e) => {
  e.preventDefault();
  recognition.start();
  await new Promise(r => setTimeout(r, 300));
  btn.classList.add('ready'); // ここで「話していいよ」表示
});

btn.addEventListener('touchend', () => {
  recognition.stop();
  btn.classList.remove('ready');
});

continuous: true vs false

Mode	Pros	Cons
`continuous: false` + auto-restart	Tends to be stable on iOS	Brief gap when restarting
`continuous: true` + singleton	Less choppy, fewer sounds	Risk of buffer clogging on iOS

With push-to-talk, continuous: true is fine. If you want it to keep listening automatically, use this hybrid:

const isIOS = /iPad|iPhone|iPod/.test(navigator.userAgent);
recognition.continuous = !isIOS;

let shouldBeListening = false;

recognition.onend = () => {
  if (isIOS && shouldBeListening) {
    setTimeout(() => recognition.start(), 200);
  }
};

Background Handling

It tends to die when the page goes to the background, so add a guard.

document.addEventListener('visibilitychange', () => {
  if (document.hidden) {
    recognition.stop();
  }
});

window.addEventListener('focus', () => {
  // 必要なら再開処理
});

Conclusion

Perfect reliability is unrealistic — That’s just how iOS’s WebSpeech API is.
Push-to-talk + singleton + mic warm-up is the pragmatic answer.
For production use, consider paid services such as the Whisper API.