Kyutai STT

A speech-to-text optimized for real-time usage.

Kyutai STT is a streaming speech-to-text model architecture, providing an unmatched trade-off between latency and accuracy, perfect for interactive applications. Its support for batching allows for processing hundreds of concurrent conversations on a single GPU. We release two models:

kyutai/stt-1b-en_fr, a low-latency model that understands English and French, and has a built-in semantic voice activity detector.
kyutai/stt-2.6b-en, a larger English-only model optimized to be as accurate as possible.

Try out kyutai/stt-1b-en_fr here:

Streaming and accurate

Word error rate, lower is better.

Kyutai STT is a streaming model, meaning it transcribes audio in real time as it receives it rather than assuming that the entire audio is known in advance. This makes it well-suited for real-time applications such as Unmute.

It outputs well-formatted transcripts with punctuation, and comes with word-level timestamps as well.

In terms of accuracy, it still performs on par with state-of-the-art non-streaming models that have access to the whole audio at once.

Semantic voice activity detector

For cascaded voice chat applications like Unmute, you need a way to determine that the user is done talking and we should generate a response.

The most common way of doing this is to have a separate voice activity detection model that determines whether the user is speaking or not, and wait a fixed amount of time after the user is done talking.

In practice, it's impossible to find a waiting time that would fit all cases. People often make long pauses during their sentences, which lead to false positives in the naive approach.

To solve this, Kyutai STT predicts not only the text but also the probability that the user is done talking. The delay for the pause prediction adapts based on the content and intonation of what the user is saying.

You can play around with this in the demo above. Look for "End of speech detected".

The semantic VAD is available in the Rust server but not yet in the other implementations.

Low latency

The delay of kyutai/stt-1b-en_fr is set to 500ms, meaning words will be transcribed 500ms after they are said. For kyutai/stt-2.6b-en, the delay is 2.5 seconds.

In Unmute, we use what we call the "flush trick" to reduce the response latency further. Once the voice activity detector predicts that the user is done talking, we have to wait an additional 500ms (the delay of the STT model) to ensure we don't cut off the end of the transcript.

To reduce this delay, we exploit the fact that the speech-to-text server is able to process audio faster than real-time. When we detect the end of speech, we ask the STT server to process the audio we've already sent as fast as it can. The server runs at around 4x real time, so it can process this audio in around 125ms = 500ms/4. This way, we "warp time" and we only have to wait these 125ms to be certain that we've transcribed everything.

High throughput

Kyutai TTS is well-suited to production settings: on an H100, it can transcribe 400 real-time audio streams simultaneously.

This is thanks to the delayed streams modeling architecture (see below), which allows us to run the model with a high batch size without needing any "glue code" on top of the model to allow for streaming.

For comparison, turning Whisper to a streaming model required a whole separate research project, Whisper-Streaming. The system repeatedly runs Whisper on the last few seconds of the audio and stitches the overlapping transcripts together.

Whisper-Streaming is technically impressive but does not support batching, leading to a much lower throughput. For a lower target delay, its throughput decreases further, because it needs to re-run Whisper more often.

Implementations

We provide different implementations of Kyutai STT, depending on your use case. Instructions for running all of these are available on GitHub.

PyTorch: for research and tinkering

If you want to call the model from Python for research or experimentation, use our PyTorch implementation.

Rust: for production

If you want to serve Kyutai STT in a production setting, use the Rust implementation. This is what we use in Unmute.

Our robust Rust server provides streaming access to the model over websockets. See the delayed-streams-modeling repo on how to run it. We use this server to run Unmute; on a L40S GPU, we can serve 64 simultaneous connections at a real-time factor of 3x.

MLX: for on-device inference on iPhone and Mac

MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.

If you want to run the model on a Mac or an iPhone, choose the MLX implementation.

Delayed streams modeling

The main innovation of Kyutai STT is a technique developed at Kyutai called delayed streams modeling, which we pioneered with Moshi.

The usual way of using a language model to do speech-to-text is to use a model that receives the whole audio at once as input and generates the text step-by-step (autoregressively). For instance, this is what Whisper does, using an encoder-decoder transformer.

In Kyutai STT, we instead represent the data as time-aligned streams of text and audio. Essentially, the audio and text are "next to" each other rather than after one another. The text stream is padded to ensure the timing of the text is aligned with the audio. We just delay the text stream by a few frames to allow the speech-to-text some lookahead.

We train Kyutai STT on text-audio data represented this way, teaching it to model both the audio stream and the text stream. During inference, we keep the audio stream fixed, feed in the input audio in as we receive it, and use the model to predict the text stream.

Another neat property of this approach is its symmetry. We can get a text-to-speech by delaying the audio stream instead of the text stream, and then keeping the text fixed (teacher-forcing) and predicting the audio instead. A bit of trickery is required to allow the model to predict padding tokens to align the timing of the text stream with the audio stream. We'll provide more details once we open-source the text-to-speech model.

We are working on a paper that will explain both models in full detail.

Learn more about

Credits

Kyutai STT, Kyutai TTS, and Unmute was created by Alexandre Défossez, Edouard Grave, Eugene Kharitonov, Laurent Mazare, Gabriel de Marmiesse, Emmanuel Orsini, Patrick Perez, Václav Volhejn, and Neil Zeghidour, with support from the rest of the Kyutai team.