Tech 10 min read

NII's 48,000-Hour Audio Dataset Is Raw Material for TTS

IkesanContents

NII/LLMC released CC Audio and Archive.org Audio Dataset as a Japanese “audio corpus.”
Open it up, though, and it reads less like a dataset and more like an index plus retrieval procedure for getting at 48,000+ hours of Japanese audio resources.

NII’s press release frames this as enabling “R&D for next-generation speech generation AI and speech recognition AI.”
The natural question is whether you can build TTS (Text-to-Speech) from this, or whether it’s for audio embedding analysis.

Getting Audio from URL Lists

The core deliverable is a list of audio URLs, metadata, and a downloader.
The actual audio files are not in the repository.

The workflow for using this looks like:

flowchart TD
    A[URL list and metadata] --> B[Fetch audio via downloader]
    B --> C[Check terms of use and reachability]
    C --> D[Segment and normalize audio]
    D --> E[Filter by language and sound type]
    E --> F[Training data for ASR, TTS, or audio models]

No redistributing bulky audio binaries means the repository stays small.
On the other hand, link rot, download failures, upstream terms of use, format inconsistencies, duplicates, and noise all land on your plate.

NII’s announcement explicitly states that the audio files themselves are not included in the repository and that responsibility for complying with each original source’s terms of use lies with the user.
The metadata is licensed for information analysis purposes as defined by Article 30-4 of Japan’s Copyright Act.

No Pre-Computed Embeddings

For anyone wondering “what about the vectors?”—this release does not distribute pre-computed embeddings.
What NII describes in the press release is audio URLs, metadata, a downloader, and Whisper-AT-based acoustic event classification in 10-second segments.

Embeddings are feature representations produced by running audio through a model.
Even for the same audio file, the vectorization method changes depending on what you’re trying to do.

PurposeRepresentationUse case
Speech recognitionFrame-level phoneme/word featuresAudio to text
Speaker recognitionSpeaker identity embeddingsSame-speaker matching, speaker clustering
TTSSpeech tokens, speaker embeddings, prosody featuresText to speech
Acoustic event classificationSound-type featuresClassifying music, speech, ambient noise
SearchWhole-audio semantic vectorsSimilar audio retrieval, multimodal RAG

When I looked at Sentence Transformers v5.4’s multimodal embedding, the design of feeding audio into the same encode() call came up there too.
That was about “the model/API that turns inputs into vectors.” NII’s dataset is about “the audio resources that become those inputs.”
Different layer entirely.

You Can Build TTS from This, but It’s Not TTS-Ready

Looking at 48,000 hours of audio, it seems like you could train a TTS model directly.
But what TTS needs isn’t just audio volume.

TTS training requires text-audio alignment.
For Japanese specifically, readings, pitch accents, numeral/symbol pronunciation rules, speaker IDs, recording quality, silence segments, BGM contamination, and multi-speaker separation all matter.
Podcast recordings and archive audio don’t come with any of that sorted out.

MioTTS built a codec splitting audio into content tokens and a global embedding, then trained an LLM-based TTS on top.
sarashina2.2-tts extracts speech tokens, zero-shot embeddings, and acoustic features from reference audio to condition the model.
Building a TTS model requires a substantial audio processing pipeline before the dataset even enters the picture.

This dataset adds to the raw material supply.
You run ASR (Automatic Speech Recognition) to generate transcripts, VAD (Voice Activity Detection) to isolate speech segments, speaker diarization to separate speakers, and quality filters to strip BGM and noise.
Only after all that does it become practical to feed the data into TTS or voice dialogue model training.

Training Material for LLM-jp-Moshi

NII states it will use this data to improve LLM-jp-Moshi.
LLM-jp-Moshi is a Japanese simultaneous bidirectional voice dialogue model, built by fine-tuning Moshi—an English 7B full-duplex voice dialogue model—with Japanese voice dialogue data.

“Simultaneous bidirectional” means the model listens and responds while the user is still talking.
This is different from the serial pipeline of “run ASR, feed into LLM, read back with TTS.” Turn-taking and interruptions are built into the model architecture itself.

The voice API survey I wrote earlier on this blog was about how to combine STT, LLM, and TTS from the application layer.
LLM-jp-Moshi operates at a lower level, learning voice dialogue as a model rather than orchestrating components.
At that level, you need more than clean read-aloud audio—conversations, pauses, backchannels, non-verbal vocalizations, and noisy real-world recordings all contribute.

The Grunt category in NII’s Whisper-AT classification covers non-verbal human vocalizations like throat clearing and breathing.
In standalone TTS, these are often noise to be removed. In voice dialogue models, they contribute to making conversation feel human.

Archive.org Side Is Heavy on Music and Environmental Sound

CC Audio draws from Common Crawl 2025-18 snapshot RSS feeds, yielding audio URLs, metadata, and a downloader. Filtered to Japanese only, it covers roughly 24,000 hours.
Podcast audio dominates, spanning English, Spanish, German, Japanese, and 20+ other languages.

Archive.org Audio Dataset also covers roughly 24,000 hours, but targets Japanese content exclusively.
The content isn’t all speech, though.
NII’s announcement reports that music accounts for about 50%, speech about 7%, with the rest including animal sounds, vehicle noise, and other audio.

If you’re expecting TTS material, the music and environmental sounds on the Archive.org side look like a detour.
But audio AI broadly needs the ability to distinguish and recognize music, environmental sound, and non-verbal vocalizations.
Video generation AI and audio-visual generation are also moving toward handling dialogue, environmental sound, and effects simultaneously—a direction that connects to the challenges seen in MOVA’s simultaneous video-audio generation.

What’s Actually in There

Each CC Audio entry contains an audio URL, title, description, language code, and source page URL.
It also includes Whisper-generated transcript segments (start/end timestamps and text), so there’s a rough text-audio alignment out of the box.
Japanese entries number about 56,700.

Looking at Whisper-AT classification for Japanese samples: None (unclassifiable) 44.1%, Music 17.9%, Speech 14.5%, Grunt 13.3%, Animal 4.9%.
Speech at only 14.5% makes sense because even podcast audio contains BGM, jingles, and silence segments.
Grunt at 13.3% represents breaths, backchannels, and hums—nearly as much as classified Speech.

The Archive.org side has about 245,000 entries, with identifier, title, description, and language in the metadata.
Unlike CC Audio, there are no Whisper transcriptions.

Sorting Archive.org’s Japanese audio by download count, the top entries include game soundtracks (AKIRA, Final Fantasy X/XIII, Persona 4/5, Super Mario Galaxy), anime soundtracks (Urusei Yatsura, Ultraman), LibriVox readings (Hyakunin Isshu, Oku no Hosomichi, Natsume Soseki’s “Kokoro,” Dazai Osamu’s “Good-Bye”), Japanese language textbooks (Minna no Nihongo), Vocaloid compilations, and classical koto recordings.
Game and anime soundtracks make up a thick layer, which tracks with the 50% Music figure from the previous section.
The 7% Speech is mostly readings and language textbooks—almost no raw conversational recordings.

Whisper-AT’s AudioSet taxonomy itself has 527 categories.
Human voice alone splits into male, female, child, whisper, shout, laughter, and crying. Instruments break down from strings, keyboards, percussion, and wind to individual instrument names. Animals go from dogs and cats to birds and frogs.
In this dataset, however, only the top-level categories (Music, Speech, Grunt, Animal, Vehicle, etc.) are used, not all 527.

Looking at actual Archive.org entries with their identifier gives a more concrete picture of the dataset’s contents.
The identifier maps to https://archive.org/details/{identifier} for the original page.

identifierContentDownloads
65-dai-18-ka-mondai-2Minna no Nihongo I 2nd Ed. Audio CD355K
hyakunin_isshu_librivoxHyakunin Isshu reading (LibriVox)226K
oku_no_hosomichi_librivoxOku no Hosomichi reading (LibriVox)197K
furusato_1308_librivoxFurusato reading (Shimazaki Toson, LibriVox)129K
AKIRAOriginalSoundtrackAKIRA Soundtrack120K
hikibiki_podcastHiikibiki Podcast117K
25_20200430JLPT N4 Sentence Patterns & Examples99K
john_22Japanese Bible Old/New Testament MP382K
kokoro_natsume_um_librivoxKokoro reading (Natsume Soseki, LibriVox)81K
meian_1403_librivoxMeian reading (Natsume Soseki, LibriVox)79K
smap-singlesSMAP Singles Collection63K
flying-beagleFlying Beagle (Himiko Kikuchi)53K
final-fantasy-x-original-soundtrackFF X Soundtrack52K

Language textbooks and LibriVox readings dominate by download count.
These tend to be clean, single-speaker utterances, the most practical layer for ASR and TTS.
Soundtracks and J-POP form the bulk of the Music 50% but require heavy preprocessing for speech model use.

A single CC Audio entry is structured as podcast metadata with Whisper transcripts attached at the segment level.

{
  "audio_url": "https://example.com/podcast/ep42.mp3",
  "title": "Tech Chat ep.42",
  "description": "Recent trends in LLMs",
  "language": "ja",
  "page_url": "https://example.com/podcast/ep42",
  "transcript_segments": [
    {"start": 0.0, "end": 10.5, "text": "Hello, this is episode 42 of Tech Chat"},
    {"start": 10.5, "end": 22.3, "text": "Today we're going to talk about LLMs"}
  ],
  "transcript": "Hello, this is episode 42 of Tech Chat. Today we're going to talk about LLMs..."
}

The transcript_segments include start/end seconds per segment, so temporal audio-text alignment is available.
Since these are Whisper’s automatic transcriptions, expect noise like misrecognized proper nouns and hallucinated text over BGM segments.

A Large Index for Research

This release doesn’t give you a ready-made TTS API.
But as a URL index of Japanese audio and acoustic data, it’s substantial.

Looking at training data scales for large audio models, tens of thousands of hours is no longer an extraordinary number.
NII’s paper “Construction of a Large-Scale Audio Dataset Using Common Crawl” cites Whisper at 680,000 hours, Moshi at 7 million hours, and Kimi-Audio at over 13 million hours.
Having access to 48,000 hours of Japanese audio isn’t world-record scale on its own—it’s a practical foothold for filling gaps in Japanese-language data.

For an individual to download the full set and train on it, the storage, bandwidth, licensing terms, cleaning effort, and compute requirements are heavy.
The natural audience is labs and companies building Japanese speech recognition, voice dialogue, acoustic event classification, audio embeddings, or preprocessed TTS corpora.

References