Thank you for the valuable feedback on the drafts: Chung-Ming Chien, Moritz Boehle, Richard Hladík, Eugene Kharitonov, Patrick Perez, and Tom Sláma. I’d also like to thank the rest of the Kyutai team for the the research discussions without which this article could not exist.
The plan: sandwich a language model in an audio encoder/decoder pair (=neural audio codec), allowing it to predict audio continuations.
As of October 2025, speech LLMs suck. Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (see Unmute), but it’s a wrapper, not real speech understanding. The model can’t hear the frustration in your voice and respond with empathy, it can’t emphasize important words in its answer, it cannot sense sarcasm, and so on.
Yes, there are LLMs (Gemini, ChatGPT’s Advanced Voice Mode, Qwen, Moshi) that understand and generate speech natively. But in practice, they’re either not as smart, or they behave like text model wrappers. Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.
Clearly, speech LLMs lag behind text LLMs. But why? For text, we found out a few years ago that if you take a lot of text data, a big Transformer, and a lot of GPUs, you’ll get some pretty damn good text continuation models. Why can’t we just replace text with audio and get pretty damn good speech continuation models?
As a teaser, here’s what happens when you try to do that naively (warning, loud):
We’ll have a look at why audio is harder to model than text and how we can make it easier with neural audio codecs, the de-facto standard way of getting audio into and out of LLMs. With a codec, we can turn audio into larger discrete tokens, train models to predict continuations for these tokens, and then decode those back into audio: see animation above.
Kyutai folks have done a lot of work in this space, which is part of the reason I chose to cover this topic. We’ll start from the basics and build up all the way to Mimi, our neural audio codec. It was originally developed for Moshi and later adopted by others for their models, notably Sesame’s CSM.
To tokenize text, everybody uses a technique called byte-pair encoding and rarely changes the tokenizer: OpenAI has been using the same tokenizer since GPT-4o, an ancient model if you count in LLM years.
A random text from Wikipedia tokenized via the GPT-4o tokenizer
You can even get decent results without tokenizing text at all, just predicting individual characters. One of the first posts that got me excited about machine learning was Andrej Karpathy’s RNN effectiveness blog post from 2015. Karpathy trains a three-layer LSTM on a single GPU and gets it to generate decent-looking code and LaTeX:
Remember this was ten years ago, back when we didn’t even know that attention is all we need. Now compare Karpathy’s results to a sample from WaveNet, a model DeepMind published a year later:
Purely acoustically, the audio sounds good, but it rarely even manages to produce a single correct English word. We can’t be too hard on WaveNet, though. The samples from Karpathy’s RNNs are only a few thousand characters long, but this 10-second audio consists of 160k audio samples, and WaveNet creates it by painstakingly predicting sample-by-sample.
A single second of audio consists of tens of thousands of samples, although it corresponds to just a few words. Animation from the WaveNet blog post.
It’s difficult to build models that are coherent over time scales this long, and the model also takes very long to run for so many steps.
So instead of running the model to predict the samples one-by-one directly, we’d like to train a model to compress the audio into a more manageable size. We could compress the audio, use an LLM to predict a continuation in the compressed representation, and then decompress the result.
But first, let’s get a baseline model by generating audio sample by sample, like WaveNet does. The code for all of these experiments is open-source! Check it out here. I forked Andrej Karpathy’s nanoGPT repo, a simple implementation of GPT-2.
Text and audio are kind of the same from the perspective of the language model: it’s just tokens in, tokens out. The only thing we need to do is to quantize the continuous values of the samples into discrete buckets. Like WaveNet, we’ll use the "μ-law algorithm" to get 256 buckets. We’ll treat those as 256 possible tokens.
Let’s train a language model on audio tokenized like this. For the dataset, we’ll use the Libri-Light dataset, following AudioLM (with Neil Zeghidour, Eugene Kharitonov). Its train split contains 50k hours in total, but we’ll go with 1000 hours for this experiment. With this sample-by-sample tokenization, we end up with a dataset of 53 GB.
We train a small-ish transformer of 151.28M parameters, about the size of the smallest GPT-2 variant. When we sample from the model, it makes babbling sounds (warning, loud at times!):
Often, it goes into a “crackling mode” that it can’t seem to get out of:
I also trained a smaller model, which I teased at the beginning. It’s prone to generate nightmare fuel screeches (loud!):
As you can tell, we’re not AGI yet. It sounds speech-like, but you can’t make out a single word and the voice keeps changing. No wonder: the context size of the model is 2048, which, for 16 kHz audio, translates to 128ms, not even a the length of one word. Also, these 10-second examples took 30 minutes to generate on an H100, so we’re a few orders of magnitude away from being real-time.
So let’s build a neural audio codec to compress the audio. The hope is that if we reduce the sampling rate 100x, the model will also become “100x more coherent”. An old idea in machine learning is to do this using an autoencoder: a model that takes an input, compresses it into a smaller “latent space”, and then tries to reconstruct the original input.
In our case, we’ll want an autoencoder whose latent space is quantized so that we can feed the latents into a language model and produce continuations. (You can generate continuations with unquantized latents, but it’s trickier – see the Further reading section.)
Bear with me, because we’ll take a detour from audio: let’s build a quantized autoencoder on images from Fashion MNIST. We’ll take a subset with the first three classes: t-shirt/top, trouser, and pullover.
First, let’s train a regular autoencoder to encode the images into two-dimensional space:
Training a regular autoencoder on Fashion MNIST
Each frame shows one batch of training, with some batches skipped. The little images are the autoencoder’s reconstructions for the images in the batch. I’ve added colors for the three classes (t-shirt/top=blue trousers=green, pullover=yellow), but the autoencoder doesn’t get a class as input – the space just naturally clusters by class. Let's zoom in on a few reconstructions:
Original images (top) and their reconstructed versions (bottom)
As you can tell, the reconstruction quality is not great. The images are blurry and the first two images are reconstructed to nearly the same thing. But we used a tiny network (4 fully connected layers for the encoder and decoder each) and projected into a mere two dimensions, so we can’t expect too much of our model.
Now let’s quantize these embeddings using a clustering. We’ll do something like k-means: we’ll maintain a list of the positions of the cluster centers. We initialize the positions randomly. For each training batch, we look at which embeddings would go to each cluster. (We don’t modify the embeddings, we just look at the assignment). Then we’ll nudge each cluster center towards the average position of these embeddings.
Also, if a center is unused for a while, we teleport it to a random embedding from the batch, because otherwise it has no way to get unstuck from its current position.
Quantizing by fitting a clustering on top of the autoencoder
You can see the reconstructions of the cluster centers getting refined over time.
Next, we’ll make the encoder and decoder themselves better at handling quantized embeddings during training, because currently, we’re just fitting the clustering on top of an autoencoder that is not “aware” it’s being quantized. We’d like the autoencoder to adapt to the quantization as we train it. Currently, we’re doing this:
x = get_batch()
z = encoder(x)
x_reconstructed = decoder(z)
loss = reconstruction_loss(x, x_reconstructed)
Instead of feeding the unquantized embedding into the decoder, we’ll first move it to the closest cluster:
x = get_batch()
z = encoder(x)
z_quantized = to_nearest_cluster(z) # 👈
x_reconstructed = decoder(z_quantized) # 👈
loss = reconstruction_loss(x, x_reconstructed)
There is a snag: if we do this, we won’t be able to train the autoencoder any more, because the quantization operation is not differentiable, meaning there is no gradient flowing from the loss to the weights of the encoder. Essentially, we’re no longer able to answer the question: “if I want the loss to decrease a bit, in which direction should I nudge the encoder’s weights?”
We’ll fix this problem by pretending it doesn’t exist. Yes, really. We’ll think of z_quantized
as z
moved by an arbitrary vector that doesn’t affect the gradient. That will make the gradient of z
equal to that of z_quantized
, which is why this is also known as the straight-through estimator of the gradient.
x = get_batch()
z = encoder(x)
residual = z - to_nearest_cluster(z)
# .detach() means "forget that this needs a gradient"
z_quantized = z - residual.detach()
x_reconstructed = decoder(z_quantized)
loss = reconstruction_loss(x, x_reconstructed)
In the forward pass, z_quantized
is set to the same value as before, but importantly, the gradient of z
is now equal to that of z_quantized
rather than just being 0 because of the non-differentiable to_nearest_cluster(z)
operation.
There is a price to pay for this lie. When training, the encoder’s weights will be updated to improve the reconstruction loss, but they’re updated as if the quantization didn’t happen, so they won’t move in the optimal direction. But as long as the embeddings stick close to their cluster centers, the gradient direction will still be mostly correct.
We can actually encourage the encoder to make embeddings that are easily quantizable by adding a commitment loss: a penalty for each point based on how far it is from its cluster center. The gradient of this loss will push the points closer to their cluster centers.
By quantizing at training time and adding a commitment loss, it’s no longer just a clustering being fit on top of the embeddings. The model itself is trained to be good for quantization.
An autoencoder trained explicitly to be easy to quantize
You’ll notice that the training dynamics look different: the commitment loss adds a certain “stiffness” that doesn’t allow the embeddings to move around as easily.
Here’s what the reconstructions look like when we use the quantized representations:
Notice how the first two images are reconstructed to exactly the same image. That’s simply because their embeddings got assigned to the same cluster and therefore quantized to the same value.
The model described here is known as a “VQ-VAE”: a vector-quantized variational autoencoder. The word “variational” here is just a vestigial leftover that doesn’t mean anything anymore.
To improve the reconstruction fidelity, we can just increase the number of cluster centers. But keeping track of too many centers can get prohibitively expensive in terms of compute and memory required, so we’ll do a clever trick: if we want 2^20 (~1M) possible values, we won’t create 2^20 clusters directly. Instead, we’ll use two separate quantizers with 2^10=1024 clusters and combine their result. Each embedding will then be quantized to a tuple of two integers in [0..1023], yielding 2^20 possible combinations.
Ok, but how? Well, recall the residual
variable we used in the straight-through estimator, defined as z - to_nearest_cluster(z)
the shift from the quantized embedding to the unquantized one. It represents the part of the original vector z
that we didn’t manage to take into account when quantizing to to_nearest_cluster(z)
.
So for each embedding in the batch, we have a corresponding residual vector. The solution is obvious: we’ll quantize these residuals exactly the same way we did with the original embeddings, by training another vector quantizer.
This time, the 2D positions for a single quantizer don’t define images because we need to combine the two quantizers, so we’ll just visualize everything as dots:
Two-level quantization by fitting a quantizer on top of the “residuals”, aka the errors of the first quantizer
Each image is then represented as the index of the cluster of the embedding and that of the residual. Let’s try to reconstruct a few images with this two-level quantizer:
Original images (top), one-level reconstruction (middle), two-level reconstruction (bottom). These images are encoded as (4, 3), (4, 5), (16, 21), and (30, 3).
The reconstructions of the first two images are similar, but no longer the exact same: the first image is represented as (4, 3) and the second as (4, 5). In other words, they share the same token for the first level, but differ in how the residual is quantized. The differences are quite subtle, so here’s a comparison between the one-level and two-level reconstructions:
Difference between one-level and two-level reconstructions
I’d like to emphasize that the second quantization level makes modifications to the embedding, not the output pixels directly. This can be seen by the fact that the leftmost and rightmost image are encoded as (4, 3) and (30, 3) respectively. So they have the same residual code, 3, but it modifies the two reconstructed images in different ways.
Clearly, the reconstructions are still not very accurate. The upper bound on the quality is the reconstruction from unquantized embeddings, so if your autoencoder is bad (and ours is), improving the quantization won’t save you.
We’ll stop here, but a natural extension to this idea is to go beyond two levels. Just take the residuals of the two-level reconstruction and quantize those, and so on. This generalized Residual Vector Quantization algorithm looks like this:
def rvq_quantize(z):
residual = z
codes = []
for level in range(levels):
quantized, cluster_i = to_nearest_cluster(level, residual)
residual -= quantized
codes.append(cluster_i)
return codes
Residual vector quantization was first applied to neural audio codecs in SoundStream, but the idea has been around since the 80s.
Applying RVQ to audio is fairly straightforward. As our autoencoder, we’ll use a convolutional neural network (CNN) similar to what Jukebox uses. The details of the architecture aren’t too important here. What’s important is that it’s a network that takes an audio with t samples and converts it to a vector of shape (t/128, 32). In other words, it downsamples by a factor of 128 and gives us 32-dimensional float representations. The decoder then takes the (t/128, 32) embeddings and decodes them back into t samples.
audio = get_batch() # shape: [B, T]
z = encoder(audio) # shape: [B, T/128, 32]
audio_reconstructed = decoder(z) # shape: [B, T]
As before, we’ll add an RVQ after the encoder. The only difference from the image case is that for each audio sample, we have t/128 embedding vectors, not just a single one as we did for images. We just quantize these independently (even though the encoder “sees” more audio than what corresponds to that one vector). During training, we also have a batch dimension, so our model now looks like this:
audio = get_batch() # [B, T]
z = encoder(audio) # [B, T/128, 32]
# Combine the batch and time dimensions
z = rearrange( # [B*T/128, 32]
z, "b t_emb d -> (b t_emb) d"
)
codes = rvq_quantize(z) # integers, [B*T/128, levels]
z_quantized = codes_to_embeddings(codes) # [B*T/128, 32]
z_quantized = rearrange( # [B, T/128, 32]
z, "(b t_emb) d -> b t_emb d"
)
audio_reconstructed = decoder(z_quantized) # [B, T]
The last missing piece before we can train our first neural audio codec is a loss function. There’s a whole rabbit hole we could go into about which one to choose, but we’ll avoid it and just use a very simple one. We’ll compute the log amplitude spectrogram of the original and reconstructed audio, and take their difference. The loss is the mean square of this difference between spectrograms.
To make it harder for the model to overfit to this loss, we take the spectrogram with three different parameters for the short-time Fourier transform, and let our loss be the mean between the three sub-losses. This is called the multi-scale spectral loss.
Image from Evan Radkoff’s excellent blog post about loss functions in audio ML. Check it out if you want to go down the loss function rabbit hole.
Finally, let’s train some codecs! We’ll look at how varying the number of RVQ levels affects the reconstruction quality. As we expected, increasing the number of levels helps, decreasing the spectral loss:
Let’s hear what the codecs sound like. We’ll use the three codecs to reconstruct this audio from the Expresso dataset:
And the reconstructions:
Clearly, the audio gets better as we add more RVQ levels.
Even with 16 levels, there is some crackling, the audio sounds muffled, and there is a constant high-pitched noise. Later we’ll discuss how we could improve the codec further, but for demonstration purposes, this will do.
So now we have a neural audio codec: we can turn audio into LLM-friendly tokens and back. Codec just means a tokenizer for audio, but we say codec because that’s the term used for classic compression like MP3. I’ll be using codec and tokenizer interchangeably.
Let’s come back to what we wanted to do in the first place: modeling audio. Specifically, we’ll make a model that can take an audio prefix and generate a plausible continuation for it.
Just as a reminder, we want to train good audio LLMs so that we have models that understand and produce speech natively, understanding emotion, emphasis, and so on. They could also be fine-tuned into text-to-speech, speech-to-text, or translation models, among others.
So now that you’re convinced that audio LLMs are the path to AGI, let’s train a few.
For our dataset, we’ll use Libri-Light, like we did for our sample-by-sample model earlier. This time we’ll use 10000h of audio instead of 1000h. It’s a dataset of public-domain audiobooks, so if we have a good model for it, maybe we’ll be able to generate more stories. (Don’t get your hopes up too much.) All we need to do is to convert the audio dataset into a sequence of discrete tokens so that we can feed it into an LLM.
We’ll do that using our 8-level RVQ codec. From an audio with t samples, we’ll get an array of tokens of shape (t/128, 8). But now there’s an issue: how to deal with the fact that for each time step, there’s not one but eight tokens? This is not a problem we have to deal with in text LLMs, where we have a single sequence of tokens.
We’ll do the simplest thing possible and just flatten the array into 1D of shape (t/128 * 8), and have our LLM predict the eight levels in separate time steps.
Flattening a three-level RVQ to allow it to be fed into a language model
The big disadvantage is that we lose some of our temporal compression. We downsampled the audio 128x, but now we’re inflating it 8x again by flattening the levels. That makes inference less efficient, and possibly worse quality because the effective context size decreases. We'll be using the 8 RVQ codec rather than the 16 RVQ one to avoid making the compression even worse.
You could also predict all RVQ levels for a single step at once (”parallel pattern”), but it also makes things harder for the model because it has to decide on all levels at once. There are a bunch of other schemes people have tried to balance compression and quality. Here are a few tried out in MusicGen:
Figure taken from MusicGen
Interestingly, as of 2025, there is no single solution that “won”: every paper does something different, and the schemes can get quite involved. Just look at this diagram from MiMo-Audio, a model released in September 2025:
Ways to deal with multiple RVQ levels can get quite involved
Time to finally train a codec-wrapped language model! As I’ve mentioned, our code is based on Andrej Karpathy’s nanoGPT codebase for training text LLMs. We just need to modify it to accept audio as input. But that’s easy, because LLMs don’t care about what kind of tokens you’re feeding in – it’s all just numbers. Once we’ve tokenized the dataset and flattened it into a 1D sequence, we’re good to go. Tokenized this way, our 10000 hours of audio take up 134 GB. For comparison, storing this much data as uncompressed audio would take over 1 TB.
We’re going to use the exact same model architecture and hyperparameters as for the sample-by-sample model: the only difference is in the tokenization. We also have a 10x bigger dataset, but the sample-by-sample model can’t even fit the dataset with 1k hours, so more data wouldn’t save it.
I trained the model on 8 H100s for about 5 days. To get some samples, I decided to prompt the model with a sample of Libri-Light reading of two lines from Michael Field’s poem July. (As I learned when working on this, Michael Field is a pen name of Katherine Harris and Edith Emma Cooper.) Let’s see what kind of poetry we can get from our model:
There are some signs of life, but we don’t have a poet yet. It sounds like somebody speaking behind a curtain. You can’t really make out what it’s saying, but the intonation is there: it sounds like somebody reading from a book, which is indeed what the model was trained on.
It also maintains a coherent voice, until it decides for the last few seconds to switch to a different one. That is also consistent with the data: we sample the training data from a concatenation of all the audiobooks chopped up into segments and mixed together, so the model does encounter boundaries between different speakers.
Our codec was deliberately simplistic, which explains why the results aren't great—but there's been a good amount of research on neural audio codecs in the last four years that we could leverage. We won’t implement all the improvements here, but instead we’ll look at what happens when we use Mimi as the tokenizer.
Mimi is a modern neural audio codec built here at Kyutai for Moshi, our audio language model. It’s since been used as the tokenizer for other models as well, like Sesame CSM, VoXtream, and LFM2-Audio.
Unsurprisingly, Mimi sounds a lot better than the homemade codec we trained earlier.
Instead of the multi-scale spectral loss, Mimi uses an adversarial loss, like a GAN. There’s a discriminator network that tries to classify audios as being original or reconstructed by the codec, and the goal of the codec is to fool this discriminator.
Another improvement Mimi adds is using RVQ dropout: it uses 32 RVQ levels but during training, the reconstruction is sometimes randomly truncated to a lower number of levels. That allows us to run Mimi for a lower number of RVQ levels at inference time and still get decent results, because it doesn’t rely on all levels being present. For our codec, we had to train separately.
Let’s hear our example audio reconstructed with Mimi:
Original
For our purposes, a variant with fewer levels might have the advantage of being easier to model because it’s more compressed. Let’s train models with 8- and 32-level Mimi and compare the results.
I trained the exact same model architecture as before, the only thing that changes is the tokenizer. It’s 10k hours from Libri-Light as the dataset, just like when we used our simple codec. Mimi has a sample rate of 24 kHz but Libri-Light uses 16 kHz, which puts a cap on how good it can sound, since we lose the higher frequencies of the audio.
Mimi downsamples the audio a lot more aggressively, too: its sample rate is 12.5 frames per second, whereas we used 125 frames per second for our codec – 10x higher! This means the dataset is also smaller on disk. With our codec, it took 134 GB, but for Mimi it’s “just” 54 GB.
Here’s a poem generated with the model trained on Mimi-tokenized data. I prompted it with two lines from the poem, as before:
Here is my best attempt at a transcription:
When the grass is gone
And corn still grassy;
Illness worried in the fur
this and pelan in stones
during the turan’s ciscerey
headforths nepet Paul Twain.
He sees zin in them.
A tad too surrealist for my taste, but maybe Lewis Carroll would like it.
I have a confession to make: I lied to you just now. But just a bit, and for didactic purposes. In fact, the model above was trained on audio from a 31-level Mimi, where I omitted the very first level, which contains the “semantic token”.
The role of this token is to represent semantic information of the audio, without necessarily aiding reconstruction. I won’t go into how these work, but in one sentence, Mimi’s semantic tokens are distilled from WavLM, which you can think of as a BERT for speech.
To get a feeling for what information semantic tokens encode, let’s take this example audio, passed through Mimi:
Now let’s train a language model trained on the full Mimi, including semantic tokens. We’re going to run the model in a way where we keep the semantic tokens from the original audio but we discard the others, and let the model predict them. That means the information from the semantic tokens is fixed (”teacher-forced”), but the model is free to decide the others according to what continuations it finds plausible.
We can get an idea of what information is contained in semantic tokens by keeping them fixed and letting the model regenerate the rest.
Listen to two different reconstructions we obtain this way:
The voice is completely different, but it’s saying the same thing! This means the semantic tokens encode what the person is saying, but are invariant to the voice. That’s useful because it helps the model focus on what to say, not how to say it. In that regard, they’re closer to text tokens, which also don’t contain information about the voice, intonation, timing, or emotion.
Now let’s take the model trained on semantic Mimi and ask it to complete the poem:
When grass is gone
and corn still grassy;
from the man was nothing moan.
The low death and heart
She came fyde wood.
A finteriest, a fall,
all them.
It still makes up words and the sentences are not too coherent, but clearly, the proportion of real words is much higher; the model is “more semantic”. The acoustic quality is the same, which is what we’d expect.
Let’s listen to a second poem:
When grass is gone
and corn still grassy;
hope won and she
who is just a night in Tatan
in doe ock-ohm?
the whom?
Indeed, the whom?
We can sacrifice some acoustic quality to improve the semantics by reducing the number of RVQ levels. Let’s do 8. That way, we get higher audio compression, and a proportionally higher part of the loss comes from the semantic token, since now it’s 1/8 tokens and not just 1/32.
One of the first things I noticed about this model is that it learned to memorize the Librivox notice, so it sometimes generates things like:
Chapter 6 of The Founday, by R. Auclair.
This is a Librivox recording. All Librivox recordings are in the public domain. For information, or to volunteer, please visit librivox.org.
Reading by: Kelvert
Repeating the training data is generally not what you want, but in our case it’s a great sign of life, because the previous models couldn’t even manage that. It also makes up the book, author, and reader, so there is still novelty here.
Now let’s try to make some more poetry:
When grass is gone
and corn still grassy;
When so we could say
that in fairy interesting wife
who lay there and gone
that save the rosy light of life
Jay Dien, the antique mollity
and a mollity the beast of gray failed summonend of poem.
This recording is in the public domain.
[different voice]
So we have formed a float that sent in would rattle down. The piece of opportunity reading and assimila—
This is great. There are several signs of the model being better than the previous ones. I love that it makes up the word “mollity” and then repeats it in the next line. Also, it realizes that it’s reciting a poem and ends the section with “end of poem”. Then it decides it’s the end of the chapter/section and it ends with the “This recording is in the public domain.” disclaimer. After that, it changes the voice and continues talking. That makes sense, since the clips from various audiobooks are just shuffled and concatenated during training, so here the model simulated a clip boundary.
We might get even better results by weighing the loss of the semantic tokens higher than the acoustic tokens, to make the model focus more on the meaning than the sound – in fact, Moshi uses a semantic loss factor of 100x! But we have to stop somewhere.
We’ve managed to use neural audio codecs to make an audio language model that generates somewhat coherent speech. Obviously, that’s not where the state of the art is in 2025 (and we’re not trying to reach it here) but keep in mind that by using the exact same model without neural audio codecs gives us this:
Of course, still a long way to go to match text models! Currently, there seems to be a trade-off between speech understanding and reasoning abilities. At the beginning, I mentioned that the speech-native models (Gemini, ChatGPT’s Advanced Voice Mode, Qwen, Moshi) aren’t able to tell you whether you’re speaking in a high or low voice, despite the fact that they’re trained to natively understand audio. This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
Kyutai took a stab at creating a voice chat based on an audio language model with Moshi (demo, paper), released in July 2024. Moshi might not be the AI you’d pick to do your homework for you, but cut it some slack: it was the first end-to-end voice AI, released even before OpenAI’s Advanced Voice Mode.
Moshi models an “inner monologue” text stream in parallel with audio streams for itself and the user. The text stream is helps it plan what it’s going to say, and ablations showed that the text stream helps the model massively. At the same time, it’s a bit sad: most of the reasoning seems to be delegated to the text stream and the audio streams are just there to provide an integrated speech-to-text and text-to-speech.
Moshi models two audio streams and a text stream in parallel
It’s not just Moshi: as the “am I speaking in a high voice” experiment shows, this over-reliance on text in favor of audio is an issue for all audio LLMs. And that’s even though the dominant modeling approach is somewhat different than Moshi’s: interleaving text and audio tokens instead of modeling them in parallel streams.
Over a year after Moshi, audio models still lag behind text LLMs. But why? To me, this mysterious unsolved “modality gap” makes audio ML an exciting field to work on.
Thank you for reading! The code for the experiments is here, and for the animations here.
Here are some papers to check out if you'd like to learn more. This list is naturally Kyutai-centric because that's the school of thought I'm exposed to; my goal is not to do a complete review of the field.
van den Oord et al., 2016. WaveNet: A Generative Model for Raw Audio
Mehri et al., 2016. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
van den Oord et al., 2017. Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Kumar et al., 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Kong et al., 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
van den Oord et al., 2017. Neural Discrete Representation Learning
Esser et al., 2020. Taming Transformers for High-Resolution Image Synthesis
Lakhotia et al., 2021. On Generative Spoken Language Modeling from Raw Audio
Zeghidour et al., 2021. SoundStream: An End-to-End Neural Audio Codec
Lee et al., 2022. Autoregressive Image Generation using Residual Quantization
Défossez et al., 2022. High Fidelity Neural Audio Compression
Hsu et al., 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Défossez et al., 2024. Moshi: a speech-text foundation model for real-time dialogue
Dieleman, 2025. Generative modelling in latent space
Peng et al., 2025. VibeVoice Technical Report
Rouard et al., 2025. Continuous Audio Language Models
Here are some modern LLMs (as of October 2025) that natively support audio. Again, I'm not trying to maintain a complete list here, and I'm not including models without any published technical details.
Moshi (Kyutai, 2023): the online demo of Moshi, Kyutai's audio language model – see above.
CSM (Sesame, 2025): a natural-sounding voice chat, based on Llama + Mimi.
Qwen3-Omni (Alibaba, 2025): Alibaba's multimodal LLM. The audio output is created by a "talker" model whose outputs are not fed back into, which, as far as I can tell, basically makes it a text model with an integrated text-to-speech.
MiMo-Audio (Xiaomi, 2025): an audio-only language model that shows promising few-shot capabilities, similar to what GPT-2 did for text.
LFM2-Audio (Liquid AI, 2025): audio/text language model, uses Mimi as the codec.