Making an Explainer Video with Remotion + VOICEPEAK

I put the research from my previous article into practice. I turned the Search Data Structures post into a video.

The Finished Video

11 scenes, about 2 minutes and 23 seconds. It covers Trie, Double Array, Inverted Index, Suffix Array, Suffix Tree, BK-tree, N-gram, Bloom Filter, B+ Tree, and LSM Tree.

Setup

# Create Remotion project
npx degit remotion-dev/template movies-video
cd movies-video
npm install

# For audio analysis
npm install @remotion/media-utils

The VOICEPEAK CLI runs at /Applications/voicepeak.app/Contents/MacOS/voicepeak.

Generating Audio

Generate narration with VOICEPEAK.

/Applications/voicepeak.app/Contents/MacOS/voicepeak \
  -s "今回は、検索を高速化するためのデータ構造を紹介します。" \
  -n "Japanese Female 1" \
  -o public/audio/intro.wav

List available narrators with --list-narrator.

/Applications/voicepeak.app/Contents/MacOS/voicepeak --list-narrator

Implementing Lip Sync

The mechanism switches the character’s mouth based on audio volume. visualizeAudio from @remotion/media-utils gets the volume level, and the mouth opens when it crosses a threshold.

import { useAudioData, visualizeAudio } from "@remotion/media-utils";

// Mouth patterns (subtle opening looks more natural)
const openMouths = ["kana_i.png", "kana_u.png", "kana_e.png"];
const closedMouth = "kana_n.png";

// Get volume
const audioData = useAudioData(audioSrc);
const visualization = visualizeAudio({ fps, frame, audioData, numberOfSamples: 32 });
const avgVolume = visualization.reduce((a, b) => a + b, 0) / visualization.length;

// Open mouth when above threshold
const isSpeaking = avgVolume > 0.01;

Tuning Points

Mouth switching too fast → Switch every few frames (6-frame interval)
Opening too wide looks unnatural → Use only i, u, e — exclude a and o
Volume smoothing → Average over 4 frames

Diagram Components

I implemented the data structure diagrams in React/SVG. No need to create slide images.

// Trie example
<svg>
  <circle cx={100} cy={50} r={20} fill="#3b82f6" />
  <text x={100} y={55}>c</text>
  <line x1={100} y1={70} x2={60} y2={120} stroke="#666" />
</svg>

Animation is controlled with useCurrentFrame() and interpolate().

const frame = useCurrentFrame();
const opacity = interpolate(frame, [0, 15], [0, 1], { extrapolateRight: "clamp" });

Scene Composition

The Series component plays scenes sequentially.

const scenes = [
  { title: "Search Data Structures", audioFile: "audio/intro.wav", durationInFrames: 340 },
  { title: "Trie", audioFile: "audio/trie.wav", durationInFrames: 530, slide: <TrieVisualization /> },
  // ...
];

// Sequential playback with Series
<Series>
  {scenes.map((scene, i) => (
    <Series.Sequence key={i} durationInFrames={scene.durationInFrames}>
      <Scene title={scene.title} audioFile={scene.audioFile} slide={scene.slide} />
    </Series.Sequence>
  ))}
</Series>

Character Placement

Waist-up framing (cropped at the waist) feels more settled than bust-up. Bust-up fills the screen too aggressively.

<div style={{
  position: "absolute",
  right: 0,
  bottom: 0,
  width: 500,
  height: 620,
  overflow: "hidden",  // Crop the lower body
}}>
  <Img src={staticFile(mouthImage)} style={{ height: 1100, top: 0 }} />
</div>

Rendering

# Full version (1080p)
npx remotion render SearchDataStructures out/video.mp4 --codec h264

# Web embed version (720p, lightweight)
npx remotion render SearchDataStructures-web out/video-web.mp4 --codec h264 --crf 30

Version	Resolution	Size
Full	1920x1080	27 MB
Web	1280x720	7.8 MB

No slide images needed: SVG/React is more than enough for visualizing data structures
Lip sync tuning matters: Don’t open too wide, keep the switching rate low
Now I understand why “Yukkuri”-style character videos are so popular: Bust-up framing is just too in-your-face

With better assets, you could generate videos just by writing the outline.