Making an Explainer Video with Remotion + VOICEPEAK
I put the research from my previous article into practice. I turned the Search Data Structures post into a video.
The Finished Video
11 scenes, about 2 minutes and 23 seconds. It covers Trie, Double Array, Inverted Index, Suffix Array, Suffix Tree, BK-tree, N-gram, Bloom Filter, B+ Tree, and LSM Tree.
Setup
# Create Remotion project
npx degit remotion-dev/template movies-video
cd movies-video
npm install
# For audio analysis
npm install @remotion/media-utils
The VOICEPEAK CLI runs at /Applications/voicepeak.app/Contents/MacOS/voicepeak.
Generating Audio
Generate narration with VOICEPEAK.
/Applications/voicepeak.app/Contents/MacOS/voicepeak \
-s "今回は、検索を高速化するためのデータ構造を紹介します。" \
-n "Japanese Female 1" \
-o public/audio/intro.wav
List available narrators with --list-narrator.
/Applications/voicepeak.app/Contents/MacOS/voicepeak --list-narrator
Implementing Lip Sync
The mechanism switches the character’s mouth based on audio volume. visualizeAudio from @remotion/media-utils gets the volume level, and the mouth opens when it crosses a threshold.
import { useAudioData, visualizeAudio } from "@remotion/media-utils";
// Mouth patterns (subtle opening looks more natural)
const openMouths = ["kana_i.png", "kana_u.png", "kana_e.png"];
const closedMouth = "kana_n.png";
// Get volume
const audioData = useAudioData(audioSrc);
const visualization = visualizeAudio({ fps, frame, audioData, numberOfSamples: 32 });
const avgVolume = visualization.reduce((a, b) => a + b, 0) / visualization.length;
// Open mouth when above threshold
const isSpeaking = avgVolume > 0.01;
Tuning Points
- Mouth switching too fast → Switch every few frames (6-frame interval)
- Opening too wide looks unnatural → Use only
i,u,e— excludeaando - Volume smoothing → Average over 4 frames
Diagram Components
I implemented the data structure diagrams in React/SVG. No need to create slide images.
// Trie example
<svg>
<circle cx={100} cy={50} r={20} fill="#3b82f6" />
<text x={100} y={55}>c</text>
<line x1={100} y1={70} x2={60} y2={120} stroke="#666" />
</svg>
Animation is controlled with useCurrentFrame() and interpolate().
const frame = useCurrentFrame();
const opacity = interpolate(frame, [0, 15], [0, 1], { extrapolateRight: "clamp" });
Scene Composition
The Series component plays scenes sequentially.
const scenes = [
{ title: "Search Data Structures", audioFile: "audio/intro.wav", durationInFrames: 340 },
{ title: "Trie", audioFile: "audio/trie.wav", durationInFrames: 530, slide: <TrieVisualization /> },
// ...
];
// Sequential playback with Series
<Series>
{scenes.map((scene, i) => (
<Series.Sequence key={i} durationInFrames={scene.durationInFrames}>
<Scene title={scene.title} audioFile={scene.audioFile} slide={scene.slide} />
</Series.Sequence>
))}
</Series>
Character Placement
Waist-up framing (cropped at the waist) feels more settled than bust-up. Bust-up fills the screen too aggressively.
<div style={{
position: "absolute",
right: 0,
bottom: 0,
width: 500,
height: 620,
overflow: "hidden", // Crop the lower body
}}>
<Img src={staticFile(mouthImage)} style={{ height: 1100, top: 0 }} />
</div>
Rendering
# Full version (1080p)
npx remotion render SearchDataStructures out/video.mp4 --codec h264
# Web embed version (720p, lightweight)
npx remotion render SearchDataStructures-web out/video-web.mp4 --codec h264 --crf 30
| Version | Resolution | Size |
|---|---|---|
| Full | 1920x1080 | 27 MB |
| Web | 1280x720 | 7.8 MB |
- No slide images needed: SVG/React is more than enough for visualizing data structures
- Lip sync tuning matters: Don’t open too wide, keep the switching rate low
- Now I understand why “Yukkuri”-style character videos are so popular: Bust-up framing is just too in-your-face
With better assets, you could generate videos just by writing the outline.