kyutai logo

MoshiVis — Teaching Moshi to Converse about Images

Amélie Royer*, Moritz Böhle*, Gabriel de Marmiesse, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, and Patrick Pérez.
*equal contribution.

Apache License 2.0

With Moshi, we paved a new road towards natural speech-to-speech interactions with large language models. Today, we introduce MoshiVis, an open-source Vision Speech Model (VSM) with the same low-latency and natural conversation skills as Moshi, with the additional ability to discuss visual inputs. To do so, MoshiVis augments Moshi with lightweight adaptation modules for visual inputs which are trained with a data-efficient one-stage training pipeline. Here, we provide additional details and qualitative samples to accompany our preprint on MoshiVis.

/ Talk to MoshiVis
Image for Example interaction with MoshiVis — see below for more!
Example interaction with MoshiVis — see below for more!

Takeaways:

MoshiVis Overview

Our goal is to design an efficient and effective adaptation mechanism that allows Moshi to discuss images whilst maintaining its previous conversational capabilities and low-latency. To achieve this, MoshiVis was designed as a perceptually augmented version of Moshi: We extend Moshi with lightweight cross-attention (CA) modules to infuse the visual information from an off-the-shelf visual encoder into the stream of speech tokens of Moshi. Specifically, the currently released model builds off the frozen vision encoder from PaliGemma2-3B-448.

To ensure that Moshi’s previous conversational abilities are not lost in the process, we also introduce a gating mechanism in the cross-attention modules, allowing the model to “turn off” the visual input stream at every layer; for an overview of MoshiVis, see the above model sketch.

In practice, for input images of size 448px (1024 tokens), MoshiVis only adds ~7ms of latency per inference step on a consumer-grade device (MacMini with M4 Pro Chip), for a total of 55ms per inference step, thus staying well below the 80ms second limit of real-time latency for our 12.5Hz Mimi codec. You can try this on our online demo starting today.

Example Interactions

Context Switch

Image for From general to image-specific conversation
From general to image-specific conversation
Image for From image-specific to general conversation
From image-specific to general conversation
Image for Text reading and general knowledge
Text reading and general knowledge
Image for Providing background information
Providing background information

Visual Understanding

Image for Providing a detailed description of the image
Providing a detailed description of the image
Image for Contextualization
Contextualization

Speech Diversity

Image for Speaking with a pirate voice
Speaking with a pirate voice
Image for Whispering
Whispering

Adaptation Modules

As discussed above, we employ cross-attention (CA) to augment Moshi with visual capabilities. To improve parameter and memory efficiency, the parameters of the cross-attention modules are also shared across all transformer blocks. At every layer k, the CA module:

1) Uses the output of the self-attention module as query for timestep t and layer k.
2) Uses the output of a frozen off-the-shelf image-encoder to form the keys and values of the cross-attention mechanism.
3) Applies a learned gating mechanism to suppress the output of the cross attention module if deemed unnecessary by the model.

Image for Learnable Gating: Highlighting image-relevant tokens
Learnable Gating: Highlighting image-relevant tokens

MoshiVis Text Stream
Hey, how's your day? I doing well. How about you? That sounds interesting. What exactly is a language model? Sure, my name is Moshi. I was created by Kyutai, a non profit research lab based in Paris. image shows a framed artwork hanging on a white wall, featuring a blue circle with three black dots at its center and two more at the bottom, all within a white border, and the text "2023" and "Kyutai" at the top left. Based on the text 'Kyutai' and '2023', it seems like this poster is promoting an event or organization. The blue circle with three black dots could symbolize something significant, perhaps a launch or a product reveal. The additional two black circles at the bottom might indicate other events or projects related to Kyutai. I' not sure about specific projects, but I know that Kyutai is working on developing large multimodal models for AI applications. You welcome.

In the above example, we illustrate how the gating mechanism yields higher values (orange) on image-relevant tokens, whereas tokens with a small gating value (green) tend to contain more generic textual information.

Training MoshiVis

For this project, we placed a particular focus on developing a low-cost finetuning recipe that makes training models like MoshiVis more easily accessible to the community. In particular, we focused on 1) curating high quality, diverse, and balanced training data for “visual conversations”, 2) limiting the amount of audio data required for training MoshiVis, 3) and ensuring an efficient training and inference pipeline that caters to low-budget environments, in particular by limiting the amount of trainable parameters and the additional compute required at inference time.

1) Curating high quality data. We develop a fully synthetic data pipeline to create realistic and diverse multi-turn conversations about images based on high-quality image-caption pairs; specifically, we primarily rely on highly detailed human-curated image captions (DOCCI and PixMo). To ensure diversity in the generated visual conversations, we have two instances of Mistral Nemo models with different (and dynamically changing) instructions “talk to each other” for multiple turns. For a detailed description of our synthetic data generation pipeline for visual dialogues, please see our arXiv Paper (Appendix E).

2) Limiting audio data. We explore how to additionally leverage existing text data about images to train Moshi to speak about them in audio form. Interestingly, we find that the “inner monologue” that is used as a scaffolding for Moshi’s thought process lends itself very well for this: specifically, we show that it is possible to train the above-mentioned cross-attention mechanism on such “speechless” data, despite the fact that it follows a distribution that is different from the speech data.

3) Efficient training. We opt for a parameter-efficient fine-tuning approach in which we keep the pre-trained, off-the-shelf vision encoder (usually a PaliGemma-based vision encoder) as well as the speech transformer (Moshi) frozen during training, and only learn a cross-attention based adaptation module. At inference time, the keys and values in the cross-attention mechanism—which are based on the input image—can be cached and need to be computed only once, since the image embedding is independent of the conversation.

Evaluations and Results

Downstream Evaluation

To evaluate how well knowledge transfers from the text-only speechless data samples the to speech modality in practice, we train MoshiVis on several downstream vision understanding benchmarks with varying amounts of audio data. Specifically, we use OCR-VQA for reading, VQAv2 for detailed visual question-answering, and COCO captioning for general image understanding.

We find that even with very little audio data present at training time (\(\leq\)1%), MoshiVis achieves surprisingly strong results when prompted in audio form. With more audio (e.g. 25% of samples), MoshiVis reaches comparable performance—when prompted in audio—as finetuned PaliGemma, despite the fact that the core LLM backbone and image encoder are kept frozen during training of the adaptation modules.

This result has important implications: First, it allows us to train a VSM in a data- and compute-efficient manner. Specifically, text-only data 1) allows for faster training as these samples take less time to load and preprocess, 2) it significantly reduces the cost of upstream text-to-speech conversion, 3) and it also requires less storage space in memory. Second, these results highlight an interesting benefit of the “Inner Monologue” text stream inherent to Moshi: While in this project we only investigate the text-to-speech transfer in the context of vision-language alignment, the underlying principle extends well beyond the vision modality, and thus opens a promising path for efficiently adapting Moshi for different down-stream tasks, even when task-specific speech data is scarce—we are excited to further explore this aspect in the future and are looking forward to community contributions that leverage it!

Conversational MoshiVis on Academic Benchmarks

In addition to assessing the downstream performance of finetuned MoshiVis models, we also evaluate the conversational model itself on those benchmarks. While we generally observe lower scores quantitatively, this can be in part attributed to its increased verbosity. This is illustrated below through the comparison between MoshiVis and a model trained only for COCO captioning below: The details added by the conversational model are highlighted in orange.

In fact, more detailed and descriptive outputs often lead to lower scores in automated metrics like CIDEr, which favor conciseness to match the original COCO ground-truths. This highlights an interesting trade-off between achieving high benchmark scores, favoring the usual one question-one answer paradigm with sometimes rigid output formatting, as opposed to generating richer descriptions which are more fit for a natural conversational flow.

Input Image
Caption Generated by the Respective Model
Two teddy bears
MoshiVis (full training):
two teddy bears in a store, one in a blue Hawaiian shirt with a brown ribbon, the other in a brown shirt with a blue ribbon

MoshiVis (COCO only):
Two teddy bears are on display in a store
Bunch of bananas
MoshiVis (full training):
a close-up of a bunch of bananas, with a hand reaching in to pick one, and a blue sticker on one of them

MoshiVis (COCO only):
A bunch of bananas that are in a bin
Boy playing baseball
MoshiVis (full training):
a young boy in a baseball uniform, mid-action, with a baseball glove on his right hand

MoshiVis (COCO only):
A young boy in a field of grass holding a catchers mitt

Impact of Speechless Data on Audio Quality

As discussed above, we can leverage “speechless” data to teach task-specific skills such as captioning, reading, and general visual question answering to MoshiVis. However, using absolutely no audio samples (0%) during training leads to incoherent generated speech, as there is no supervision for the output audio tokens. Nevertheless, we find that the quality of the generated speech sharply increases even when adding just relatively few audio samples during training (e.g. 5%), as illustrated in the figure below.

Input Image
Example Image

Trained with

Ref: 100% Audio

Works best in Chrome and Firefox. The samples are “unprompted”, i.e. the model speaks freely about the images.

Here, we show a comparison between full audio training on COCO captioning (Ref: 100% Audio) and training with differing amounts of audio data (Trained with 0%, 5%, 10%, 25% audio data).

Real-time Streaming and Latency

Despite incorporating an additional modality input, MoshiVis maintains real-time latency on consumer-grade hardware with processing times staying well below the 80ms threshold per inference step (12.5 Hz codec): As illustrated in the graph on the right, we compare the time per inference step of the original Moshi model and MoshiVis on a MacMini with an M4 Pro Chip. MoshiVis adds only about ~7ms of latency per inference step; thus easily allowing for real-time interactions throughout a 5 minute conversation.
/ Talk to MoshiVis

This project is funded by Iliad Group, CMA CGM Group and Schmidt Sciences.