Unmute

Make any LLM listen and speak

using Kyutai's speech-to-text and text-to-speech.

The speech-to-text transcribes what the user says in real-time.

It also figures out when you've stopped speaking but doesn't interrupt you mid-sentence (semantic VAD).

A text LLM generates the response.
Works with any LLM you like.

The text-to-speech starts generating audio from the text response, even before it's fully generated. In total, the response latency is below a second.

The key strength of Unmute is its modularity. Since the LLM that generates the text is independent of the speech-to-text and text-to-speech that give it a voice, you can directly leverage all capabilities of your favorite language models, such as strong reasoning and connection to external tools. As LLMs improve, so will Unmute.

In the demo at unmute.sh, we demonstrate a few examples of function calling: if the model decides to say "bye", Unmute will recognize it as a command and actually hang up the call. The "Dev (news)" character pulls up-to-date news from the internet via an API.

These are simple examples, but the point is that it is clearly powerful to have the ability to communicate with an intelligent system using your voice. By putting this technology into the hands of all developers for free, we hope to enable new products and use cases that we could never have thought of ourselves. Here are some ideas.

Use cases

Personal assistants: Deliver on the original promise of smart assistants such as Siri, Google Assistant, and Alexa.
Accessibility tools: Help people with disabilities to interact with technology using their voice.
Customer support: AI customer support hotline, without the frustrating latency of older chatbots.
Voice-based programming: Vibe code an app into existence without ever touching a keyboard.
Digital dungeon master: For tabletop RPGs or games like Werewolf.
Role-playing as historical characters: Would a digital Alan Turing pass his own test?

What about Moshi?

In 2024 we unveiled Moshi, the first audio-native model. While Moshi provides unmatched latency and naturalness, it doesn’t yet match the extended abilities of text models such as function-calling, stronger reasoning capabilities, and in-context learning. Unmute allows us to directly bring all of these from text to real-time voice conversations.

Unmute, like all cascaded systems, does have limitations: since the LLM "brain" communicates using text, it loses access to emotion, emphasis, intonation, hesistation, and tone of voice – important elements of human communication.

We strongly believe that the future of human-machine interaction lies in end-to-end speech models, coupled with customization and extended abilities. Stay tuned for what’s to come!

Learn more about

Credits

Kyutai STT, Kyutai TTS, and Unmute was created by Alexandre Défossez, Edouard Grave, Eugene Kharitonov, Laurent Mazare, Gabriel de Marmiesse, Emmanuel Orsini, Patrick Perez, Václav Volhejn, and Neil Zeghidour, with support from the rest of the Kyutai team.

Graphic design of unmute.sh was done by Kristýna Hrabánková (Instagram, LinkedIn).