A text-to-speech optimized for real-time usage.
Kyutai text-to-speech started as an internal tool we used during the development of
Moshi. As part of our
commitment to open science, we are now releasing an improved version of the TTS
to the public:
kyutai/tts-1.6b-en_fr
, a 1.6B parameter model.
It contains several innovations that make it particularly useful
for real-time usage.
There are many more voices available than the ones shown in the demo. Check out the voices repository.
To try out the TTS in an interactive real-time way, check out Unmute.
Kyutai TTS sets the new state of the art in text-to-speech. Word error rate (WER) measures how often the TTS fails to adhere to the script. Speaker similarity is a metric of how close the generated audio sounds to the original sample when doing voice cloning.
We compare Kyutai TTS to other models on 15 news articles in English and 15 in French from NTREX. All models but Kyutai TTS and ElevenLabs are asked to generate sentence-by-sentence because we observed the best results are obtained this way.
Model | WER (EN) ↓WER EN | Speaker sim. (%, EN) ↑Sim EN | WER (FR) ↓WER FR | Speaker sim. (%, FR) ↑Sim FR |
---|---|---|---|---|
Kyutai TTS | 2.82 | 77.1 | 3.29 | 78.7 |
ElevenLabs Flash v2.5 | 4.05 | 48.3 | 4.64 | 53.1 |
ElevenLabs Multilingual v2 | 3.1 | 57.4 | 2.75 | 63.8 |
Chatterbox | 5.47 | 70.3 | - | - |
Dia | 4.16 | 69.7 | 14.69 | 61.1 |
CSM | 5.47 | 82.5 | - | - |
Orpheus | 4.66 | 34.6 | - | - |
Kyutai TTS doesn't need to know the whole text in advance and has a latency of 220ms from receiving the first text token to generating the first chunk of audio. On the Unmute.sh deployment, we use batching to serve up to 32 requests simultaneously and we observe a latency of 350ms, using a L40S GPU.
Other TTS models described as streaming are only streaming in audio: you still need to know the full text in advance.
Kyutai TTS is the first text-to-speech model that is also streaming in text. You can pipe in text as it's being generated by an LLM and Kyutai TTS will already start processing it, leading to ultra-low latency.
This is especially useful when the LLM generation takes a long time, such as low-resource environments or when generating long chunks of text.
Streaming in text is made possible by delayed streams modeling, described in more detail below.
To specify the voice, we use a 10 second audio sample. The TTS matches the voice, intonation, mannerisms and recording quality of the source audio.
To ensure people's voices are only cloned consensually, we do not release the voice embedding model directly. Instead, we provide a repository of voices based on samples from datasets such as Expresso and VCTK. You can help us add more voices by anonymously donating your voice.
Most transformer-based TTS models are optimized for generating <30 seconds of audio, and either do not support longer generation at all or break down on longer samples. Kyutai TTS has no issues generating much longer audio:
Kyutai TTS ships with a robust Rust server that provides streaming access to the model over websockets. We use this server to run Unmute. On a L40S GPU, we can serve 16 simultaneous connections at a real-time factor of over 2x.
We provide a Dockerfile that you can use to reproducibly run the server without worrying about dependencies.
Along with the audio itself, Kyutai TTS outputs the exact timestamps of the words that it's generating. This can be useful for providing real-time subtitles for the TTS.
In Unmute, we use this function to handle cases when the user interrupts the AI. If you interrupt mid-way through an explanation to ask a follow-up question, Unmute will know exactly where it got interrupted and which part of the explanation still remains to be said later.
Kyutai TTS supports English and French. We are exploring ideas on how to add support for more languages. Our LLM, Helium 1, already supports all 24 official languages of the EU.
Kyutai TTS' unique capabilities such as streaming in text, batching, and timestamps are enabled by a technique developed at Kyutai called delayed streams modeling, which we pioneered with Moshi.
The usual way of using a language model to do text-to-speech using a language model is to train on concatenations of input text and the tokenized audio output:
That means that these models are streaming, but only in audio: Given the full text, they start generating audio and allow you to access their partial results while they're running. However, the full text still needs to be known in advance.
In Kyutai TTS, we instead model the problem as a time-aligned stream of text and audio. Essentially, the audio and text are "next to" each other rather than after one another. We just delay the audio stream by a few frames to allow the text-to-speech some lookahead:
This means we can start streaming the audio output as soon as we know the first few text tokens, no matter how long the final text ends up being.
On the input, we receive a stream of words without timing information, but the model needs the text stream to be aligned
to the audio using padding tokens. This is the role of the action stream: when it predicts [word]
,
it means "I'm done pronouncing the current word, give me another one",
at which point we feed in the next word into the text stream.
Another neat property of this approach is its symmetry. We can get a speech-to-text by just delaying the text stream instead of the audio stream, and then keeping the audio fixed (teacher-forcing) and predicting the text instead:
We are working on a paper that will explain both models in full detail.
Kyutai STT, Kyutai TTS, and Unmute was created by Alexandre Défossez, Edouard Grave, Eugene Kharitonov, Laurent Mazare, Gabriel de Marmiesse, Emmanuel Orsini, Patrick Perez, Václav Volhejn, and Neil Zeghidour, with support from the rest of the Kyutai team.