It’s finally here! We’ve open sourced both Unmute and Kyutai TTS, the text-to-speech model that gives Unmute its voice.
The response to Unmute has been amazing and we can’t wait to see what you can do with it. The ideas people have mentioned to us include a personal voice assistants, accessibility tools, voice-based coding, and even digital dungeon masters.
Check out the the shiny new project page to get started.
Kyutai TTS
Unmute’s character comes from our text-to-speech model, Kyutai TTS. It’s a natural and fast text-to-speech model based on the delayed streams modeling modeling paradigm developed at Kyutai.
It can be batched efficiently, generating up to 32 streams simultaneously on a L40S GPU with a latency of 350ms and faster than real-time.
It comes with many other goodies such as support for French, word-level timestamps, streaming in text, efficient inference, and an MLX implementation.
Kyutai TTS bases its voice on a 10 second sample. We provide a dataset of hundreds of voices from datasets such as Expresso and VCTK. To prevent abuse, we do not release the voice cloning capabilities directly. If you’d like more voices, you can help by donating your voice or suggesting permissively-licensed voice datasets to add by opening an issue.
Modern text-to-speech models often provide no benchmark numbers at all: to fix this and aid future research, we quantitatively compare Kyutai TTS to other systems. We set the state of the art on word error rate in English as well as speaker similarity in French.

We’re excited to see what you can create with our growing collection of open-source models. Don’t hesitate to open an issue on the GitHub repo of Kyutai TTS or Unmute if something doesn’t work for you.