kyutai logo

Announcing Helium-1 Preview

| Author: Kyutai Team

We are excited to release a preview of our new backbone language model called Helium-1. As the chemical element it is named after, Helium-1 is a lightweight model with around 2B parameters. Our goal with this model is to enable the development of A.I. systems running on edge and mobile devices. At Kyutai, we are believers that latency and privacy are two key elements when it comes to personal A.I. systems, and deploying models locally is a good way to achieve these. Helium-1 will be a multi-lingual language model: the preview version currently supports 6 languages, showing strong capabilities in those languages compared to existing open weights models. We will add support for more languages in the future.

Takeaways:

How we trained Helium-1 Preview

Helium-1 is based on the transformer architecture, and uses standard improvements such as pre-normalization with RMSNorm, rotary position embeddings and feed forward layers based on gated linear units with the SiLU activation. Overall, our architecture is almost identical to the one introduced by LLaMA 1, allowing an easy and straightforward deployment using existing tools such as MLX, vLLM, ollama or llama.cpp.

For our training dataset, we relied only on publicly available sources to make our dataset reproducible. We used curated data sources, including Wikipedia, Stack Exchange and scientific articles (peS2o) as well as filtered webpages from Common Crawl. Our pipeline to filter Common Crawl starts from a WARC dump, consisting of HTML webpages. We extract the textual content with resiliparse, then apply language identification with fastText, and perform deduplication at the paragraph level. Finally, to filter out low quality webpages, we train a fastText classifier on high quality content vs. random webpages from Common Crawl. This classifier is trained at the line level, and an aggregated score for the document is obtained by taking the weighted average score of each line (weighted by the length of the line). Following previous work, we also perform a data curriculum strategy by increasing the threshold of our quality filter and by including a subset of the high quality Dolmino mix for the last 100k steps.

Our model is trained on 2.5T tokens, with a context size of 4096 and a global batch size of 1024 sequences. We use token level distillation of a 7B parameters model to train Helium-1 preview.

Helium-1 Preview evaluations

We evaluate Helium-1 preview on standard benchmarks, including closed-book question answering, common sense reasoning, machine translation or multiple-choice question answering from high-school and college subjects. We include evaluations in English as well as the five other languages currently supported by Helium-1 preview: French, German, Italian, Portuguese and Spanish. We compare Helium-1 to other existing open weight models, with 1.5B to 3B parameters.

English Results

Bench Helium-1 Preview (2.2B) HF SmolLM2 (1.7B) Gemma-2 (2.6B) Llama-3.2 (3B) Qwen2.5 (1.5B)
           
MMLU 51.2 50.4 53.1 56.6 61.0
NQ 17.3 15.1 17.7 22.0 13.1
TQA 47.9 45.4 49.9 53.6 35.9
ARC E 80.9 81.8 81.1 84.6 89.7
ARC C 62.7 64.7 66.0 69.0 77.2
OBQA 63.8 61.4 64.6 68.4 73.8
CSQA 65.6 59.0 64.4 65.4 72.4
PIQA 77.4 77.7 79.8 78.9 76.0
SIQA 64.4 57.5 61.9 63.8 68.7
HellaSwag 69.7 73.2 74.7 76.9 67.5
WinoGrande 66.5 65.6 71.2 72.0 64.8
           
Average 60.7 59.3 62.2 64.7 63.6

Multilingual Results

Language Benchmark Helium-1 Preview (2.2B) HF SmolLM2 (1.7B) Gemma-2 (2.6B) Llama-3.2 (3B) Qwen2.5 (1.5B)
  Average 42.1 27.8 42.3 43.6 40.0
             
German MMLU 45.6 35.3 45.0 47.5 49.5
  ARC C 56.7 38.4 54.7 58.3 60.2
  HellaSwag 53.5 33.9 53.4 53.7 42.8
  MKQA 16.1 7.1 18.9 20.2 10.4
  FLORES 33.9 12.2 30.7 28.2 20.8
Spanish MMLU 46.5 38.9 46.2 49.6 52.8
  ARC C 58.3 43.2 58.8 60.0 68.1
  HellaSwag 58.6 40.8 60.5 61.1 51.4
  MKQA 16.0 7.9 18.5 20.6 10.6
  FLORES 25.7 15.0 25.7 23.7 20.4
French MMLU 46.0 37.7 45.7 48.8 51.9
  ARC C 57.9 40.6 57.5 60.1 67.4
  HellaSwag 59.0 41.1 60.4 59.6 51.2
  MKQA 16.8 8.4 18.4 19.6 9.7
  FLORES 44.3 20.0 43.3 39.3 31.2
Italian MMLU 46.1 36.3 45.6 48.8 50.5
  ARC C 57.4 39.1 53.9 60.1 64.6
  HellaSwag 55.2 37.7 56.2 56.8 46.8
  MKQA 15.3 6.3 18.0 19.0 9.9
  FLORES 25.8 10.4 25.2 23.8 16.4
Portuguese MMLU 46.2 37.7 45.6 49.2 53.0
  ARC C 56.8 40.6 57.0 62.1 66.6
  HellaSwag 57.3 41.0 58.7 59.1 50.9
  MKQA 14.7 6.6 16.9 19.1 9.2
  FLORES 43.0 20.0 43.6 40.5 33.0
             
  Average 42.1 27.8 42.3 43.6 40.0

What to expect in the future

This is a preview version of the 2B parameters LLM that we are currently developing and planning to release in the coming months. The full version of the model will support more languages, and will have better capabilities. We are also planning to release our training codebase in Jax, as well as the pre-processing pipeline to reproduce our training dataset. Feel free to reach out if you have any feedback about this model!

Model card

See HuggingFace repository.