Tech 6 min read

Could I Build a Better Karaoke Scorer Now Than I Did Back Then?

Years ago, I built something at work — a karaoke scoring simulation that ran in the browser. It compared the user’s voice input to the original vocal track and calculated a similarity score. This ranks among my personal “three great ‘you’re building that in a browser?!’ projects.”

The result was… mediocre. A timing offset of just 0.2 seconds at the start was enough to tank the score. Far from practical.

Now I’m wondering: could I build something better today?

Looking Back at the Original Implementation

I still have the spec doc from back then, so let me revisit it.

Technologies Used

  • Media Capture and Streams API to capture microphone input
  • FFT (Fast Fourier Transform) for frequency decomposition
  • Correlation coefficient for similarity scoring

At the time, iOS wasn’t supported, so it only worked on Chrome for desktop and Android. Things should be a bit better now.

Processing Flow

  1. Capture audio from microphone
  2. Run FFT every 0.1–0.3 seconds → store frequency intensities as arrays
  3. Prepare FFT-preprocessed data for the original audio in the same way
  4. Compare the time evolution of each frequency band between input and original
  5. Calculate the correlation coefficient as the similarity score
  6. Compute correlation coefficients across the audible range (20 Hz–22 kHz) and sum them up

What Is FFT?

FFT (Fast Fourier Transform) converts an audio waveform into “how much of each frequency is present.”

An audio waveform looks complex in the time domain, but it’s actually a sum of sine waves at different frequencies. FFT lets you see the breakdown.

X(k)=n=0N1x(n)ei2πknNX(k) = \sum_{n=0}^{N-1} x(n) \cdot e^{-i \frac{2\pi kn}{N}}
  • x(n)x(n): input signal (time domain)
  • X(k)X(k): output (frequency domain)
  • NN: number of samples

The math looks intimidating, but it’s essentially waveform → frequency spectrum.

Why use FFT for audio comparison?

Comparing raw waveforms is too sensitive to phase differences (the starting position of the wave). FFT decomposes audio into frequency components, enabling comparison by “how much of each pitch is present.”

Waveform: ~~~∿∿∿~~~  →  FFT  →  Frequency spectrum: [440 Hz: 0.8, 880 Hz: 0.3, ...]

For karaoke scoring, this means comparing things like “is the A (440 Hz) note actually being sung?”

Can you do FFT in the browser?

Yes. Back then I used the Media Capture and Streams API, but now the Web Audio API’s AnalyserNode makes it much simpler. More on that later.

Scoring with Correlation Coefficients

Correlation coefficients range from -1 to 1, where values closer to 1 mean “changing in the same way.”

ρAB=Cov(A,B)s(A)s(B)\rho_{AB} = \frac{\text{Cov}(A, B)}{s(A) \cdot s(B)}
  • Cov(A,B)\text{Cov}(A, B): covariance of A and B
  • s(A)s(A), s(B)s(B): standard deviations of each

Calculate this coefficient for each frequency band and sum them all up. If the two audio clips are identical, every frequency band has a correlation coefficient of 1, giving the maximum score.

I also reformulated the calculation to use just the inner product of deviation vectors to reduce computation. I derived it myself back then, and looking back at it 10 years later — it checks out. Glad about that.

What Went Wrong

It was too sensitive to temporal misalignment.

The correlation coefficient compares values at the same point in time. So:

  • Singing start is 0.2 seconds late → every frame comparison is offset → score collapses
  • Tempo slightly fast or slow mid-song → offset accumulates toward the end

Real karaoke machines handle humans singing with some timing flexibility. My implementation couldn’t.

What I’d Do Today

1. Absorb Timing Offsets with DTW

DTW (Dynamic Time Warping) handles temporal misalignment by stretching and compressing the time axis to find the best match between two sequences. It’s widely used in speech recognition and handwriting recognition.

// Conceptual DTW code
function dtw(seq1, seq2) {
  const n = seq1.length;
  const m = seq2.length;
  const dp = Array(n + 1).fill(null).map(() => Array(m + 1).fill(Infinity));
  dp[0][0] = 0;

  for (let i = 1; i <= n; i++) {
    for (let j = 1; j <= m; j++) {
      const cost = Math.abs(seq1[i - 1] - seq2[j - 1]);
      dp[i][j] = cost + Math.min(
        dp[i - 1][j],     // insertion
        dp[i][j - 1],     // deletion
        dp[i - 1][j - 1]  // match
      );
    }
  }

  return dp[n][m];
}

With DTW, offsets like “singing started 0.2 seconds late” can be absorbed.

The downside is that it compares every frame from both sequences in a brute-force fashion, so computation explodes as audio gets longer. Comparing two 1-minute clips requires millions of calculations (O(n×m) complexity). FastDTW, an approximation algorithm, brings this down significantly.

2. Automatically Detect Note Onsets for Sync

Another approach: use onset detection to automatically find the start of notes and sync from there.

Onset detection identifies points where audio energy spikes suddenly (i.e., where a note begins).

function detectOnsets(audioData, threshold = 0.5) {
  const onsets = [];
  const windowSize = 1024;

  for (let i = windowSize; i < audioData.length; i += windowSize) {
    const prevEnergy = calculateEnergy(audioData.slice(i - windowSize, i));
    const currEnergy = calculateEnergy(audioData.slice(i, i + windowSize));

    // Detect sudden energy increase
    if (currEnergy > prevEnergy * threshold && currEnergy > someMinThreshold) {
      onsets.push(i);
    }
  }

  return onsets;
}

Detect onsets in both the original and input audio, then align the first onsets from each — that eliminates the starting offset.

3. Improve Pitch Detection

Back then I compared the full frequency spectrum via FFT, but for karaoke scoring, comparing just the pitch (fundamental frequency) might make more sense.

Pitch detection methods:

  • Autocorrelation: estimate pitch from the waveform’s autocorrelation
  • YIN: an improved version of autocorrelation with higher accuracy
  • FFT peak detection: take the dominant peak from FFT output (simple, but rough)
// YIN algorithm concept
function yin(audioData, sampleRate) {
  // Calculate difference function
  const diff = calculateDifference(audioData);

  // Cumulative mean normalized difference function
  const cmndf = cumulativeMeanNormalizedDifference(diff);

  // Find first valley below threshold
  const tau = findFirstValley(cmndf, threshold);

  // Calculate pitch
  return sampleRate / tau;
}

4. Web Audio API Has Made Implementation Much Easier

The spec doc from back then noted “FFT with the Web Audio API is too cumbersome.” Today, the AnalyserNode makes FFT trivial.

const audioContext = new AudioContext();
const analyser = audioContext.createAnalyser();
analyser.fftSize = 2048;

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamSource(stream);
source.connect(analyser);

// Get FFT output
const frequencyData = new Uint8Array(analyser.frequencyBinCount);
analyser.getByteFrequencyData(frequencyData);

// Get waveform data
const waveformData = new Uint8Array(analyser.frequencyBinCount);
analyser.getByteTimeDomainData(waveformData);

That’s all it takes to get FFT frequency data. Much easier than before.

Redesign Proposal

Here’s what I’d build today:

Processing Flow

Input audio → Onset detection → Pitch sequence extraction

Original audio → Onset detection → Pitch sequence extraction

                                  DTW matching

                                  Score calculation

Scoring

  1. Extract pitch sequences from both audio clips
  2. Find the optimal alignment with DTW
  3. Convert the average pitch difference after alignment into a score
function calculateScore(pitches1, pitches2) {
  // Align with DTW
  const { path, distance } = dtw(pitches1, pitches2);

  // Convert normalized distance to score (0-100)
  const maxDistance = estimateMaxDistance(pitches1.length);
  const score = Math.max(0, 100 - (distance / maxDistance) * 100);

  return score;
}

The original problems:

  • Too sensitive to timing offsets (correlation coefficient compares same-time values)
  • Score collapses if the singing start drifts

What I’d do now:

  • DTW to absorb timing offsets
  • Onset detection to auto-sync the start of singing
  • Focus on pitch detection rather than full frequency spectrum
  • Web Audio API AnalyserNode for much simpler implementation

Whether I’ll actually build the remake… maybe if I feel like it. Could be fun as a Lab tool.