Simultanenous, on-device, high fidelity speech-to-speech translation with Hibiki

February 10, 2025 | Author: Kyutai Team

As part of our on-going effort to push the boundary for speech-to-speech models, we have released Hibiki, a model for simultaneous, on-device, high fidelity speech-to-speech translation. We build on the same ideas and architecture underpinning our speech conversational agent Moshi, allowing for frugal training with in-house generated synthetic data, and on-device inference on mobile. Hibiki transfer the speaker’s voice and flow, and its quality and naturalness is closest to human interprets than existing model. All the models As part of our mission for open science, we open source Hibiki’s inference code and weights, with all the training details in our research paper. Check out our sample page for examples of applications.

Xavier de la Porte discussing the evolution of AI based translation, with a live translation from Hibiki. © Arte

What is Hibiki?

Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating— Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, with voice transfer, along with a text translation.

Architecture

Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of Moshi to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous output audio stream, along with timestamped text translation. Hibiki consist of a main backbone of 2 billion parameters. We also train a mobile version, Hibiki-M with 1B parameters, for on device inference.

Hibiki handles three streams: the audio from the user (input), the translated speech (output), and a text stream matching the output audio.

How is it trained?

Hibiki relies on supervised training of aligned source speech and target speech and text, from the same speaker. Such data does not exist in significant amounts so we rely on synthetic data generation. Word-level matching is made between source and target transcripts using a contextual alignment weakly-supervised method that leverages an off-the-shelf MADLAD machine translation system. The derived alignment rule (a word should only appear in the target once it’s predictable from the source) is applied either by inserting silences or by synthesizing targets with a voice controlled, alignment-aware TTS.

Text-based alignemnt of source and target sequences.

We detect automatically on text pairs when there is sufficient context to predict a word, based on the log-likelihood difference maximal increase.

Generating synthetic data with silence insertion and alignment-aware TTS

We generate synthetic aligned data: the output is generated with our TTS, and silences are inserted based on the estimated alignment.

Inference

At inference, Hibiki continuously encodes source speech and produces target speech. Hibiki relies on simple temperature sampling and is thus compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki’s voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will increase voice similarity, but excessive coefficients can lead to worse translations. Hibiki currently only supports French-to-English translation. Thanks to its decoder only approach, Hibiki can be batched with up to 320 (160 with CFG) parallel translations on a single H100 GPUs. Its smaller alternative, Hibiki-M can run locally on smartphone hardware. Current models were trained on sequences up to 120 seconds and use a context size of 40 seconds. We provide inference code for PyTorch (CUDA), Rust (CUDA), and MLX (iOS and OSX) on our repository.

Neil Zeghidour showcase live translation running on-device.

Evaluations and results

We perform objective and subjective evaluation of Hibiki for translations from French to English. Listen to some examples on our sample page.

Subjective evaluations

Model	Quality ↑	Speaker Sim. ↑	Naturalness ↑
Ground-Truth	4.18	-	4.12
Seamless	2.22	2.86	2.18
Hibiki	3.78	3.43	3.73

We run subjective evalutions on recordings from the European Parliament (Voxpopuli), comparing our model against a model baseline (Seamless) and human interpretors. Samples are rated on a scale from 1 (lowest) to 5 (highest) along 3 axis: quality, speaker similarity, and naturalness. We improve by a large margin on existing methods, while getting closer than ever to human interpretors.

Objective evaluations

Model	BLEU ↑	ASR-BLEU ↑	Speaker Sim.↑	Avg. Lag (sec.) ↓
Seamless	25.4	23.9	0.43	4.2
Hibiki-M	28.2	26.0	0.39	4.9
Hibiki	27.2	25.9	0.52	5.6

We also evaluate on a new long-form benchmark based on NTREX which evaluates the ability to translate real world audio made of multiple sentences. We again observe a strong improvements for the accuracy of the translations (BLEU and ASR-BLEU), speaker similarity, at the expanse of a small increase of the average lag.

What to expect in the future

We want to extend Hibiki to support many more languages, to deliver a definitive solution for live speech translation.

Model card

See HuggingFace repository.