MimikaStudio - a local TTS app that unifies multiple engines in one GUI

When you try to do local TTS, you often end up writing separate Python scripts for Qwen3-TTS, setting up a different environment for Chatterbox, and handling per-engine setup one by one. MimikaStudio is a project that pulls all of that into one GUI app, from voice cloning through audiobook creation.

Repository: BoltzmannEntropy/MimikaStudio
Official site: mimikastudio.github.io
License: BSL-1.1 for source code / proprietary license for binaries
Version: 2026.02

What it can do

Feature	Description
Voice cloning	Clone a voice from 3 seconds of reference audio. Share voice libraries across engines
Text-to-speech	Generate TTS with preset or custom voices, including style instructions
PDF reading	Read aloud while highlighting each sentence
Audiobook creation	Convert PDF, EPUB, TXT, Markdown, and DOCX to WAV, MP3, or M4B with chapter markers

Built-in engines

MimikaStudio includes four TTS engines.

Kokoro (82M parameters)

A lightweight and fast English TTS. It ships with 21 UK and US English voices. On Apple Silicon, it can run on the Metal GPU with latency below 200 ms. Japanese is not supported.

Qwen3-TTS (0.6B / 1.7B)

As I wrote in an earlier article, this is the open-source TTS developed by Alibaba’s Qwen team. In MimikaStudio, you can use both the Base model for voice cloning and the CustomVoice model for preset voices.

Voice cloning: clone a voice from 3 seconds of reference audio, with support for 10 languages
CustomVoice: 9 preset speakers such as Ryan, Aiden, Vivian, and Ono Anna
Style instructions: control emotion and delivery with prompts such as “whisper softly”

Chatterbox

A multilingual TTS developed by Resemble AI. It supports 23 languages, which makes it the broadest in the engine lineup.

Supported languages: Arabic, Chinese, Danish, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, and Swahili

On Apple Silicon it runs on the CPU. Hebrew TTS requires the Dicta ONNX model, which is about 1.1GB and is downloaded automatically during installation.

IndexTTS-2

A zero-shot TTS model. It reports SOTA-level scores in WER, speaker similarity, and emotion expression. It is strong for use cases like dubbing video, where precise duration control matters. The model is large, at around 24GB.

Engine comparison

Engine	Parameters	Japanese	Voice cloning	Languages	Strength
Kokoro	82M	❌	❌	1 (English only)	Fast and light
Qwen3-TTS	0.6B / 1.7B	✅	✅	10	Balanced
Chatterbox	-	✅	✅	23	Multilingual
IndexTTS-2	~24GB	-	✅	-	High quality, large

Architecture

The app is built from three processes.

┌─────────────────┐
│  Flutter UI      │ ← desktop app or browser (:5173)
└────────┬────────┘
         │ REST API
┌────────▼────────┐
│  FastAPI Backend │ ← port 8000, 60+ endpoints
│  (Python)        │    TTS inference, voice management, audiobook generation
└────────┬────────┘
         │
┌────────▼────────┐
│  MCP Server      │ ← port 8010, 50+ tools
│                  │    controllable from Claude Code and similar clients
└─────────────────┘

Backend: FastAPI (Python, about 8,500 lines). Handles engine wrappers, audio file management, and audiobook generation
Frontend: Flutter/Dart (about 10,100 lines). Available as a desktop app or in the browser
MCP Server: compatible with Model Context Protocol. Claude Code and other MCP clients can generate TTS and manage voices
Database: SQLite for voice libraries and project data

The whole codebase is about 18,600 lines.

MCP server integration

The MimikaStudio MCP server exposes more than 50 tools, so Claude Code can call TTS generation and voice management directly.

Tool categories include:

TTS generation on all engines
Voice sample management, including upload, deletion, and preview
Audiobook generation and progress monitoring
System information and real-time monitoring
Model status checks and downloads

The combination of AI coding tools and TTS is useful for workflows where you write a script and turn it into audio immediately.

Hardware requirements

Item	Requirement
OS	macOS 13+ (Ventura or later)
Chip	Apple Silicon (M1/M2/M3/M4)
RAM	8GB minimum, 16GB recommended
Storage	5-10GB for models
Python	3.10+
Flutter	3.x with desktop support enabled

Intel Macs are not supported. The Windows code path has CUDA support, but no prebuilt binary is available yet.

Installation

git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
./install.sh

install.sh checks and installs Homebrew, Python, espeak-ng, and ffmpeg, creates the venv, installs dependencies, initializes the SQLite database, and configures Flutter. The models are downloaded automatically on first use, for a total of about 3GB.

Launching

source venv/bin/activate

# Desktop app (backend + MCP + Flutter)
./bin/mimikactl up

# Use in a web browser
./bin/mimikactl up --web
# -> http://127.0.0.1:5173

# Backend + MCP only (no GUI, API access only)
./bin/mimikactl up --no-flutter

CLI tools

You can also use it from the command line without the GUI.

# English TTS with Kokoro
./bin/mimika kokoro "Hello world" --voice bf_emma --output hello.wav

# Qwen3-TTS preset voice
./bin/mimika qwen3 "こんにちは" --speaker Ono_Anna --style "calmly"

# Qwen3-TTS voice cloning
./bin/mimika qwen3 "test" --clone --reference voice.wav

It also supports file input (TXT, PDF, EPUB, DOCX), which is handy for batch processing.

Compared with using each engine directly

If you only want Qwen3-TTS, pip install qwen-tts is enough. Chatterbox is also available through pip. So why use MimikaStudio?

Reasons to use MimikaStudio:

It manages voice libraries across multiple engines, so cloned voices can be reused
It can turn PDFs, EPUBs, and similar documents directly into audiobooks
You can tweak parameters such as temperature, top_p, and top_k through the GUI
It can connect to AI tools through the MCP server
Model downloads are handled entirely in the GUI

Cases where direct engine use is better:

You only use one engine
You want to embed it into a Python script
You want to build a custom pipeline
You need to run it outside macOS

Licensing

The source code uses BSL-1.1 (Business Source License 1.1). Personal and internal use are free, but it is not fully open source like MIT or Apache 2.0. The plan is to move to GPL-2.0 after a certain period. Binary distribution has a separate license, the Mimika Binary Distribution License. Commercial use requires a separate agreement.

The licenses of the bundled engines are separate. Qwen3-TTS is Apache 2.0, and Chatterbox is MIT. BSL-1.1 applies only to MimikaStudio’s own code.