MimikaStudio - a local TTS app that unifies multiple engines in one GUI
Contents
When you try to do local TTS, you often end up writing separate Python scripts for Qwen3-TTS, setting up a different environment for Chatterbox, and handling per-engine setup one by one. MimikaStudio is a project that pulls all of that into one GUI app, from voice cloning through audiobook creation.
- Repository: BoltzmannEntropy/MimikaStudio
- Official site: mimikastudio.github.io
- License: BSL-1.1 for source code / proprietary license for binaries
- Version: 2026.02
What it can do
| Feature | Description |
|---|---|
| Voice cloning | Clone a voice from 3 seconds of reference audio. Share voice libraries across engines |
| Text-to-speech | Generate TTS with preset or custom voices, including style instructions |
| PDF reading | Read aloud while highlighting each sentence |
| Audiobook creation | Convert PDF, EPUB, TXT, Markdown, and DOCX to WAV, MP3, or M4B with chapter markers |
Built-in engines
MimikaStudio includes four TTS engines.
Kokoro (82M parameters)
A lightweight and fast English TTS. It ships with 21 UK and US English voices. On Apple Silicon, it can run on the Metal GPU with latency below 200 ms. Japanese is not supported.
Qwen3-TTS (0.6B / 1.7B)
As I wrote in an earlier article, this is the open-source TTS developed by Alibaba’s Qwen team. In MimikaStudio, you can use both the Base model for voice cloning and the CustomVoice model for preset voices.
- Voice cloning: clone a voice from 3 seconds of reference audio, with support for 10 languages
- CustomVoice: 9 preset speakers such as Ryan, Aiden, Vivian, and Ono Anna
- Style instructions: control emotion and delivery with prompts such as “whisper softly”
Chatterbox
A multilingual TTS developed by Resemble AI. It supports 23 languages, which makes it the broadest in the engine lineup.
Supported languages: Arabic, Chinese, Danish, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, and Swahili
On Apple Silicon it runs on the CPU. Hebrew TTS requires the Dicta ONNX model, which is about 1.1GB and is downloaded automatically during installation.
IndexTTS-2
A zero-shot TTS model. It reports SOTA-level scores in WER, speaker similarity, and emotion expression. It is strong for use cases like dubbing video, where precise duration control matters. The model is large, at around 24GB.
Engine comparison
| Engine | Parameters | Japanese | Voice cloning | Languages | Strength |
|---|---|---|---|---|---|
| Kokoro | 82M | ❌ | ❌ | 1 (English only) | Fast and light |
| Qwen3-TTS | 0.6B / 1.7B | ✅ | ✅ | 10 | Balanced |
| Chatterbox | - | ✅ | ✅ | 23 | Multilingual |
| IndexTTS-2 | ~24GB | - | ✅ | - | High quality, large |
Architecture
The app is built from three processes.
┌─────────────────┐
│ Flutter UI │ ← desktop app or browser (:5173)
└────────┬────────┘
│ REST API
┌────────▼────────┐
│ FastAPI Backend │ ← port 8000, 60+ endpoints
│ (Python) │ TTS inference, voice management, audiobook generation
└────────┬────────┘
│
┌────────▼────────┐
│ MCP Server │ ← port 8010, 50+ tools
│ │ controllable from Claude Code and similar clients
└─────────────────┘
- Backend: FastAPI (Python, about 8,500 lines). Handles engine wrappers, audio file management, and audiobook generation
- Frontend: Flutter/Dart (about 10,100 lines). Available as a desktop app or in the browser
- MCP Server: compatible with Model Context Protocol. Claude Code and other MCP clients can generate TTS and manage voices
- Database: SQLite for voice libraries and project data
The whole codebase is about 18,600 lines.
MCP server integration
The MimikaStudio MCP server exposes more than 50 tools, so Claude Code can call TTS generation and voice management directly.
Tool categories include:
- TTS generation on all engines
- Voice sample management, including upload, deletion, and preview
- Audiobook generation and progress monitoring
- System information and real-time monitoring
- Model status checks and downloads
The combination of AI coding tools and TTS is useful for workflows where you write a script and turn it into audio immediately.
Hardware requirements
| Item | Requirement |
|---|---|
| OS | macOS 13+ (Ventura or later) |
| Chip | Apple Silicon (M1/M2/M3/M4) |
| RAM | 8GB minimum, 16GB recommended |
| Storage | 5-10GB for models |
| Python | 3.10+ |
| Flutter | 3.x with desktop support enabled |
Intel Macs are not supported. The Windows code path has CUDA support, but no prebuilt binary is available yet.
Installation
git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
./install.sh
install.sh checks and installs Homebrew, Python, espeak-ng, and ffmpeg, creates the venv, installs dependencies, initializes the SQLite database, and configures Flutter. The models are downloaded automatically on first use, for a total of about 3GB.
Launching
source venv/bin/activate
# Desktop app (backend + MCP + Flutter)
./bin/mimikactl up
# Use in a web browser
./bin/mimikactl up --web
# -> http://127.0.0.1:5173
# Backend + MCP only (no GUI, API access only)
./bin/mimikactl up --no-flutter
CLI tools
You can also use it from the command line without the GUI.
# English TTS with Kokoro
./bin/mimika kokoro "Hello world" --voice bf_emma --output hello.wav
# Qwen3-TTS preset voice
./bin/mimika qwen3 "こんにちは" --speaker Ono_Anna --style "calmly"
# Qwen3-TTS voice cloning
./bin/mimika qwen3 "test" --clone --reference voice.wav
It also supports file input (TXT, PDF, EPUB, DOCX), which is handy for batch processing.
Compared with using each engine directly
If you only want Qwen3-TTS, pip install qwen-tts is enough. Chatterbox is also available through pip. So why use MimikaStudio?
Reasons to use MimikaStudio:
- It manages voice libraries across multiple engines, so cloned voices can be reused
- It can turn PDFs, EPUBs, and similar documents directly into audiobooks
- You can tweak parameters such as temperature, top_p, and top_k through the GUI
- It can connect to AI tools through the MCP server
- Model downloads are handled entirely in the GUI
Cases where direct engine use is better:
- You only use one engine
- You want to embed it into a Python script
- You want to build a custom pipeline
- You need to run it outside macOS
Licensing
The source code uses BSL-1.1 (Business Source License 1.1). Personal and internal use are free, but it is not fully open source like MIT or Apache 2.0. The plan is to move to GPL-2.0 after a certain period. Binary distribution has a separate license, the Mimika Binary Distribution License. Commercial use requires a separate agreement.
The licenses of the bundled engines are separate. Qwen3-TTS is Apache 2.0, and Chatterbox is MIT. BSL-1.1 applies only to MimikaStudio’s own code.