Today, we release several Moshi artifacts: a long technical report with all the details behind our model, weights for Moshi and its Mimi codec, along with streaming inference code in PyTorch, Rust and MLX.
- Online demo: moshi.chat.
- Technical report: arxiv.
- GitHub repo: kyutai-labs/moshi.
- HuggingFace: kyutai/moshiko-pytorch-bf16.
Moshi is made of three main components: Helium, a 7B language model trained on 2.1T tokens, Mimi, a neural audio codec that models semantic and acoustic information, and a new multi-stream architecture that jointly models audio from the user and Moshi on separate channels.
Mimi is a neural audio codec that improves over SoundStream and Encodec by jointly modeling semantic and acoustic information using distillation, inspired by SpeechTokenizer. Not only its improved architecture and adversarial training make it outperform SpeechTokenizer, RVQGAN and SemantiCodec, but we designed Mimi specifically for working with LLMs: it operates at 12.5Hz and 1.1kbps, while being fully causal and thus provides ideal tokens for a streaming Transformer.
We then augment Helium with our variant of the RQ-Transformer, an architecture previously proposed for discrete image generation, which allows modeling the hierarchy of semantic and acoustic tokens without increasing the sequence length for Helium (the Temporal Transformer) by using a smaller Depth Transformer which generates all tokens for a specific timestep. This means that we only need 12.5 passes through the 7B backbone for 1s of audio, which can run real-time on an L4 or an M3 Macbook pro! And when combined with the token delay of MusicGen, this gives state-of-the-art audio language modeling performance.
Our main contributions to generative audio are the following: multi-stream modeling which stacks the tokens of Moshi and the user for each timestep such that we can model full-duplex conversational dynamics, with overlap, backchannelling, interruptions, etc. No speaker turns anymore! Our second contribution is the Inner Monologue, a new method to drastically improve the quality of generated speech. AudioLM introduced the concept of semantic tokens to use as prefix to acoustic ones. We go a step further by predicting time-aligned text for Moshi’s speech before the semantic token. This makes Moshi smarter, while remaining compatible with streaming and still being a speech-to-speech system that understands non-linguistic information!
An interesting byproduct of Inner Monologue is that by delaying the audio tokens by a few seconds, we get a streaming TTS system, and by doing the opposite (delaying text tokens) we get a streaming ASR with alignment!
After pretraining on large-scale audio, we create synthetic conversations with our own models: Helium writes scripts, which our multi-stream TTS then converts into full-duplex conversations like the one below. In total, we create 20k hours of data with varying recording conditions and accents for the user, while keeping the voice of Moshi constant. This makes it robust to noisy environments, while making sure Moshi stays in character.
We run thorough evaluations for Helium, Mimi and Moshi, for quality, audio language modeling and spoken question answering along with extensive safety and quantization analyses. We see that Moshi strongly outperforms previous published models, while having the unique ability to model full-duplex conversations in streaming.
We release two Moshi models, adapted from our demo by replacing Moshi’s voice with artificially generated ones, one male and one female. We are looking forward to hearing what the community will build with it!