Meet Moshi, the first real-time voice AI

July 03, 2024 | Author: Kyutai Team

We don’t speak like we write, we don’t read like we listen. We all experience, every day, between humans, the fundamental differences between written communication and oral communication. Two complementary ways of using language. The former is often more precise, compact and efficient, but slower to produce and hardly interactive. The latter is the realm of spontaneity, speed, interactivity and emotion. Spoken expression can hesitate, be cut off, taken and returned; and it transmits a wealth of information beyond words. Equipping AI with genuine spoken abilities is therefore crucial, but technically challenging.

Today, we invite the world to discover Moshi, our experimental voice AI. Moshi is the lowest latency conversational AI ever released. Moshi is not an assistant, but rather a prototype for advancing real-time interaction with machines. It can chit-chat, discuss facts and make recommendations, but a more groundbreaking ability is its expressivity and spontaneity that allow for engaging into fun roleplay.

More fundamentally, Moshi is an audio language model that can listen and speak continuously, with no need for explicitly modeling speaker turns or interruptions. When talking to Moshi, you will notice that the UI displays a transcript of its speech. This does not come from an ASR nor is an input to a TTS, but is rather part of the integrated multimodal modeling of Moshi.

Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We worked hard during six months to get there, and we will shortly share all Moshi’s secrets, model and codes. In the meantime, more information is already available here, along with a live demonstration on stage.

photo credit: Nathalie Savale / Groupe Iliad