With Moshi, we paved a new road towards natural speech-to-speech interactions with large language models. Today, we introduce MoshiVis, an open-source Vision Speech Model (VSM) with the same low-latency and natural conversation skills as Moshi, with the additional ability to discuss visual inputs. To do so, MoshiVis augments Moshi with lightweight adaptation modules for visual inputs which are trained with a data-efficient one-stage training pipeline. Here, we provide additional details and qualitative samples to accompany our preprint on MoshiVis.

Takeaways:
- MoshiVis is the first real-time, speech-to-speech VSM, allowing for seamlessly discussing images with Moshi. MoshiVis is obtained by training 206M additional parameters on top of a frozen Moshi, preserving the low-latency and natural conversation abilities of Moshi.
- Given the scarcity of image-aligned speech data, we place a particular focus on data-efficient training for MoshiVis. In particular, we show that it is still possible to leverage “speechless” image-text data to teach Moshi to talk about images.
- To foster future research on VSMs, we not only release our preprint on how we trained MoshiVis, but also the visual speech benchmarks used for evaluation, the inference code for PyTorch, Apple’s MLX, and Rust, and the model weights.
MoshiVis Overview
Our goal is to design an efficient and effective adaptation mechanism that allows Moshi to discuss images whilst maintaining its previous conversational capabilities and low-latency. To achieve this, MoshiVis was designed as a perceptually augmented version of Moshi: We extend Moshi with lightweight cross-attention (CA) modules to infuse the visual information from an off-the-shelf visual encoder into the stream of speech tokens of Moshi. Specifically, the currently released model builds off the frozen vision encoder from PaliGemma2-3B-448.
To ensure that Moshi’s previous conversational abilities are not lost in the process, we also introduce a gating mechanism in the cross-attention modules, allowing the model to “turn off” the visual input stream at every layer; for an overview of MoshiVis, see the above model sketch.
In practice, for input images of size 448px (1024 tokens), MoshiVis only adds ~7ms of latency per inference step on a consumer-grade device (MacMini with M4 Pro Chip), for a total of 55ms per inference step, thus staying well below the 80ms second limit of real-time latency for our 12.5Hz Mimi codec. You can try this on our online demo starting today.
Example Interactions
Context Switch




Visual Understanding


Speech Diversity


Adaptation Modules
As discussed above, we employ cross-attention (CA) to augment Moshi with visual capabilities. To improve parameter and memory efficiency, the parameters of the cross-attention modules are also shared across all transformer blocks. At every layer k, the CA module:
1) Uses the output of the self-attention module as query for timestep t and layer k.
2) Uses the output of a frozen off-the-shelf image-encoder to form the keys and values of the cross-attention mechanism.
3) Applies a learned gating mechanism to suppress the output of the cross attention module if deemed unnecessary by the model.

MoshiVis Text Stream
Hey,
how's
your
day?
I
doing
well.
How
about
you?
That
sounds
interesting.
What
exactly
is
a
language
model?
Sure,
my
name
is
Moshi.
I
was
created
by
Kyutai,
a
non
profit
research
lab
based
in
Paris.
image
shows
a
framed
artwork
hanging
on
a
white
wall,
featuring
a
blue
circle
with
three
black
dots
at
its
center
and
two
more
at
the
bottom,
all
within
a
white
border,
and
the
text
"2023"
and
"Kyutai"
at
the
top
left.
Based
on
the
text
'Kyutai'
and
'2023',
it
seems
like
this
poster
is
promoting
an
event
or
organization.
The
blue
circle
with
three
black
dots
could
symbolize
something
significant,
perhaps
a
launch
or
a
product
reveal.
The
additional
two
black
circles
at
the
bottom
might
indicate
other
events
or
projects
related
to
Kyutai.
I'
not
sure
about
specific
projects,
but
I
know
that
Kyutai
is
working
on
developing
large
multimodal
models
for
AI
applications.
You
welcome.
In the above example, we illustrate how the gating mechanism yields higher values (orange) on image-relevant tokens, whereas tokens with a small gating value (green) tend to contain more generic textual information.
Training MoshiVis
For this project, we placed a particular focus on developing a low-cost finetuning recipe that makes training models like MoshiVis more easily accessible to the community. In particular, we focused on 1) curating high quality, diverse, and balanced training data for “visual conversations”, 2) limiting the amount of audio data required for training MoshiVis, 3) and ensuring an efficient training and inference pipeline that caters to low-budget environments, in particular by limiting the amount of trainable parameters and the additional compute required at inference time.
1) Curating high quality data. We develop a fully synthetic data pipeline to create realistic and diverse multi-turn conversations about images based on high-quality image-caption pairs; specifically, we primarily rely on highly detailed human-curated image captions
(DOCCI and
PixMo).
To ensure diversity in the generated visual conversations, we have two instances of
Mistral Nemo models with different (and dynamically changing) instructions “talk to each other” for multiple turns. For a detailed description of our synthetic data generation pipeline for visual dialogues, please see our arXiv Paper (Appendix E).
2) Limiting audio data. We explore how to additionally leverage existing text data about images to train Moshi to speak about them in audio form. Interestingly, we find that the “inner monologue” that is used as a scaffolding for Moshi’s thought process lends itself very well for this: specifically, we show that it is possible to train the above-mentioned cross-attention mechanism on such “speechless” data, despite the fact that it follows a distribution that is different from the speech data.
3) Efficient training. We opt for a parameter-efficient fine-tuning approach in which we keep the pre-trained, off-the-shelf vision encoder (usually a PaliGemma-based vision encoder) as well as the speech transformer (Moshi) frozen during training, and only learn a cross-attention based adaptation module. At inference time, the keys and values in the cross-attention mechanism—which are based on the input image—can be cached and need to be computed only once, since the image embedding is independent of the conversation.
Evaluations and Results
Downstream Evaluation
To evaluate how well knowledge transfers from the text-only speechless data samples the to speech modality in practice, we train MoshiVis on several downstream vision understanding benchmarks with varying amounts of audio data. Specifically, we use OCR-VQA for reading, VQAv2 for detailed visual question-answering, and COCO captioning for general image understanding.
We find that even with very little audio data present at training time (\(\leq\)1%), MoshiVis achieves surprisingly strong results when prompted in audio form. With more audio (e.g. 25% of samples), MoshiVis reaches comparable performance—when prompted in audio—as finetuned PaliGemma, despite the fact that the core LLM backbone and image encoder are kept frozen during training of the adaptation modules.
This result has important implications: First, it allows us to train a VSM in a data- and compute-efficient manner. Specifically, text-only data 1) allows for faster training as these samples take less time to load and preprocess, 2) it significantly reduces the cost of upstream text-to-speech conversion, 3) and it also requires less storage space in memory. Second, these results highlight an interesting benefit of the “Inner Monologue” text stream inherent to Moshi: While in this project we only investigate the text-to-speech transfer in the context of vision-language alignment, the underlying principle extends well beyond the vision modality, and thus opens a promising path for efficiently adapting Moshi for different down-stream tasks, even when task-specific speech data is scarce—we are excited to further explore this aspect in the future and are looking forward to community contributions that leverage it!
Conversational MoshiVis on Academic Benchmarks
In addition to assessing the downstream performance of finetuned MoshiVis models, we also evaluate the conversational model itself on those benchmarks. While we generally observe lower scores quantitatively, this can be in part attributed to its increased verbosity. This is illustrated below through the comparison between MoshiVis and a model trained only for COCO captioning below: The details added by the conversational model are highlighted in orange.
In fact, more detailed and descriptive outputs often lead to lower scores in automated metrics like CIDEr, which favor conciseness to match the original COCO ground-truths. This highlights an interesting trade-off between achieving high benchmark scores, favoring the usual one question-one answer paradigm with sometimes rigid output formatting, as opposed to generating richer descriptions which are more fit for a natural conversational flow.

two teddy bears in a store, one in a blue Hawaiian shirt with a brown ribbon, the other in a brown shirt with a blue ribbon
Two teddy bears are on display in a store

a close-up of a bunch of bananas, with a hand reaching in to pick one, and a blue sticker on one of them
A bunch of bananas that are in a bin

a young boy in a baseball uniform, mid-action, with a baseball glove on his right hand
A young boy in a field of grass holding a catchers mitt
Impact of Speechless Data on Audio Quality
As discussed above, we can leverage “speechless” data to teach task-specific skills such as captioning, reading, and general visual question answering to MoshiVis. However, using absolutely no audio samples (0%) during training leads to incoherent generated speech, as there is no supervision for the output audio tokens. Nevertheless, we find that the quality of the generated speech sharply increases even when adding just relatively few audio samples during training (e.g. 5%), as illustrated in the figure below.
Trained with
Ref: 100% Audio
Works best in Chrome and Firefox. The samples are “unprompted”, i.e. the model speaks freely about the images.
Here, we show a comparison between full audio training on COCO captioning (Ref: 100% Audio) and training with differing amounts of audio data (Trained with 0%, 5%, 10%, 25% audio data).
Real-time Streaming and Latency

This project is funded by Iliad Group, CMA CGM Group and Schmidt Sciences.