Tech 6 min read

MimikaStudio - a local TTS app that unifies multiple engines in one GUI

IkesanContents

When you try to do local TTS, you often end up writing separate Python scripts for Qwen3-TTS, setting up a different environment for Chatterbox, and handling per-engine setup one by one. MimikaStudio is a project that pulls all of that into one GUI app, from voice cloning through audiobook creation.

What it can do

FeatureDescription
Voice cloningClone a voice from 3 seconds of reference audio. Share voice libraries across engines
Text-to-speechGenerate TTS with preset or custom voices, including style instructions
PDF readingRead aloud while highlighting each sentence
Audiobook creationConvert PDF, EPUB, TXT, Markdown, and DOCX to WAV, MP3, or M4B with chapter markers

Built-in engines

MimikaStudio includes four TTS engines.

Kokoro (82M parameters)

A lightweight and fast English TTS. It ships with 21 UK and US English voices. On Apple Silicon, it can run on the Metal GPU with latency below 200 ms. Japanese is not supported.

Qwen3-TTS (0.6B / 1.7B)

As I wrote in an earlier article, this is the open-source TTS developed by Alibaba’s Qwen team. In MimikaStudio, you can use both the Base model for voice cloning and the CustomVoice model for preset voices.

  • Voice cloning: clone a voice from 3 seconds of reference audio, with support for 10 languages
  • CustomVoice: 9 preset speakers such as Ryan, Aiden, Vivian, and Ono Anna
  • Style instructions: control emotion and delivery with prompts such as “whisper softly”

Chatterbox

A multilingual TTS developed by Resemble AI. It supports 23 languages, which makes it the broadest in the engine lineup.

Supported languages: Arabic, Chinese, Danish, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, and Swahili

On Apple Silicon it runs on the CPU. Hebrew TTS requires the Dicta ONNX model, which is about 1.1GB and is downloaded automatically during installation.

IndexTTS-2

A zero-shot TTS model. It reports SOTA-level scores in WER, speaker similarity, and emotion expression. It is strong for use cases like dubbing video, where precise duration control matters. The model is large, at around 24GB.

Engine comparison

EngineParametersJapaneseVoice cloningLanguagesStrength
Kokoro82M1 (English only)Fast and light
Qwen3-TTS0.6B / 1.7B10Balanced
Chatterbox-23Multilingual
IndexTTS-2~24GB--High quality, large

Architecture

The app is built from three processes.

┌─────────────────┐
│  Flutter UI      │ ← desktop app or browser (:5173)
└────────┬────────┘
         │ REST API
┌────────▼────────┐
│  FastAPI Backend │ ← port 8000, 60+ endpoints
│  (Python)        │    TTS inference, voice management, audiobook generation
└────────┬────────┘

┌────────▼────────┐
│  MCP Server      │ ← port 8010, 50+ tools
│                  │    controllable from Claude Code and similar clients
└─────────────────┘
  • Backend: FastAPI (Python, about 8,500 lines). Handles engine wrappers, audio file management, and audiobook generation
  • Frontend: Flutter/Dart (about 10,100 lines). Available as a desktop app or in the browser
  • MCP Server: compatible with Model Context Protocol. Claude Code and other MCP clients can generate TTS and manage voices
  • Database: SQLite for voice libraries and project data

The whole codebase is about 18,600 lines.

MCP server integration

The MimikaStudio MCP server exposes more than 50 tools, so Claude Code can call TTS generation and voice management directly.

Tool categories include:

  • TTS generation on all engines
  • Voice sample management, including upload, deletion, and preview
  • Audiobook generation and progress monitoring
  • System information and real-time monitoring
  • Model status checks and downloads

The combination of AI coding tools and TTS is useful for workflows where you write a script and turn it into audio immediately.

Hardware requirements

ItemRequirement
OSmacOS 13+ (Ventura or later)
ChipApple Silicon (M1/M2/M3/M4)
RAM8GB minimum, 16GB recommended
Storage5-10GB for models
Python3.10+
Flutter3.x with desktop support enabled

Intel Macs are not supported. The Windows code path has CUDA support, but no prebuilt binary is available yet.

Installation

git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
./install.sh

install.sh checks and installs Homebrew, Python, espeak-ng, and ffmpeg, creates the venv, installs dependencies, initializes the SQLite database, and configures Flutter. The models are downloaded automatically on first use, for a total of about 3GB.

Launching

source venv/bin/activate

# Desktop app (backend + MCP + Flutter)
./bin/mimikactl up

# Use in a web browser
./bin/mimikactl up --web
# -> http://127.0.0.1:5173

# Backend + MCP only (no GUI, API access only)
./bin/mimikactl up --no-flutter

CLI tools

You can also use it from the command line without the GUI.

# English TTS with Kokoro
./bin/mimika kokoro "Hello world" --voice bf_emma --output hello.wav

# Qwen3-TTS preset voice
./bin/mimika qwen3 "こんにちは" --speaker Ono_Anna --style "calmly"

# Qwen3-TTS voice cloning
./bin/mimika qwen3 "test" --clone --reference voice.wav

It also supports file input (TXT, PDF, EPUB, DOCX), which is handy for batch processing.

Compared with using each engine directly

If you only want Qwen3-TTS, pip install qwen-tts is enough. Chatterbox is also available through pip. So why use MimikaStudio?

Reasons to use MimikaStudio:

  • It manages voice libraries across multiple engines, so cloned voices can be reused
  • It can turn PDFs, EPUBs, and similar documents directly into audiobooks
  • You can tweak parameters such as temperature, top_p, and top_k through the GUI
  • It can connect to AI tools through the MCP server
  • Model downloads are handled entirely in the GUI

Cases where direct engine use is better:

  • You only use one engine
  • You want to embed it into a Python script
  • You want to build a custom pipeline
  • You need to run it outside macOS

Licensing

The source code uses BSL-1.1 (Business Source License 1.1). Personal and internal use are free, but it is not fully open source like MIT or Apache 2.0. The plan is to move to GPL-2.0 after a certain period. Binary distribution has a separate license, the Mimika Binary Distribution License. Commercial use requires a separate agreement.

The licenses of the bundled engines are separate. Qwen3-TTS is Apache 2.0, and Chatterbox is MIT. BSL-1.1 applies only to MimikaStudio’s own code.