kyutai logo

Helium 1: a modular and multilingual LLM

| Author: Kyutai Team

Today, we are thrilled to announce our latest text large language model called Helium 1 — a lightweight yet powerful model with 2 billion parameters, designed to set a new benchmark within its size category. Helium 1 achieves state-of-the-art performance among models of similar scale when evaluated across a diverse set of tasks in European languages, demonstrating good multilingual capabilities and generalization.

With its compact architecture, Helium 1 is optimized for edge computing and on-device deployment, enabling fast, efficient, and private A.I. experiences directly on smartphones, embedded devices, and other resource-constrained environments. This represents a significant step toward democratizing access to A.I. and reducing reliance on cloud-based systems.

As part of our commitment to open science and transparency, we are also releasing the codebase and tools required to reproduce the training dataset, covering the 24 official languages of the European Union. This not only enhances the reproducibility of our work but also contributes to the broader goal of building equitable and inclusive language technologies that reflect the linguistic diversity of Europe.

Helium 1 is a foundational step in our long-term roadmap to deliver compact and capable language models. We look forward to engaging with the research and developer communities to further refine, apply, and extend the capabilities of Helium 1 in real-world settings.

Takeaways:

Dactory, the data factory

A critical ingredient for the development of large language models is the training dataset. In recent years, web crawled data has represented the majority of the training data for these models. In our case, we rely on the corpus of webpages distributed by the Common Crawl project. Because the content and quality of Common Crawl data is very diverse, we developed tools to process and filter the data, to obtain high quality datasets suitable for training strong language models. We release these under the dactory github repository.

Our pipeline starts from the WARC archives, which contain HTML webpages. The first step is thus to extract the main textual content of each page, using the resiliparse package. Then, we apply language identification with fastText, using the publicly available lid.176 model.

Next, we perform deduplication, which is done at the paragraph level. More precisely, we use a Bloom filter to detect duplicated lines. We then remove paragraphs that contain more than a certain number of duplicated lines (in practice, we remove paragraphs with more than 80% of duplicates). The motivation for performing deduplication at the paragraph level is the following: doing it at the document level would not remove enough duplicated content. On the other hand, when deduplicating at the line level, there is the risk of removing common lines in the middle of non-duplicated content, leading to incoherent text. Examples include a common ingredient in a list of ingredients or a common line of code in a program example. To make the deduplication process efficient, we deduplicate each shard of a dump (out of 100) in parallel. Finally, we initialize the bloom filter with lines that were found at least twice in a random subset of common crawl webpages, allowing the deduplication process to remove frequent content the first time they appear.

Finally, we perform model-based quality filtering. To do so, we collected textual content from the following high-quality sources: Wikipedia, Stack Exchange, scientific articles from pes2o and textbooks from LibreText and Wikibooks. We then trained a fastText classifier to distinguish lines from random CommonCrawl webpages versus lines from our high-quality sources. We use a line-level classifier, instead of a document-level one, because the task of distinguishing high-quality documents from random webpages is too easy, often relying on shallow information such as formatting. The obtained classifier thus leads to low performance when filtering webpages from CommonCrawl. On the other hand, a line-level classifier is significantly better at identifying lines that are high-quality, even when coming from CommonCrawl data. We obtain a document score by computing the average of the line scores, weighted by the length of each line. We train a multi-label classifier, with eight labels: wikipedia, textbooks, science, STEM, pop culture, life, humanities and random. The STEM, pop culture, life and humanities labels correspond to subsets of the Stack Exchange dataset. Having access to different labels for the high-quality webpages allows us to have a better understanding of the content of our training data, and to train specialized models on subsets of the data as we will discuss below. To get multilingual classifiers, we translate the training data of the fastText classifier with MADLAD, except for the wikipedia and random classes which are available for all the languages we are interested in.

As part of this release, we are sharing the fastText models for quality filtering of data in the 24 official EU languages, as well as the Bloom filter initialized from lines found at least twice in a subset of common crawl webpages. It takes approximately 4 hours to process one shard (out of 100) of a Common Crawl dump on 8 cores, thus making it possible to process a full dump in 4 days on a single machine with 32 cores. The dataset created with dactory is around 770 GB compressed and 2TB uncompressed. It has ~400M text documents. Around 60% of the documents are in English, 8% in Spanish, 7% in Dutch and 7% in French.

Helium architecture and training

Helium 1 is based on the transformer architecture. Following recent works, it uses common improvements of the standard transformer architecture such as pre-normalization with RMSNorm, rotary position encoding and feed-forward layers based on gated linear units with SiLU activation. To make inference more efficient, Helium 1 also uses grouped query attention. Overall, the architecture is very similar to LLaMA 2. In the following table, we give hyperparameters related to the model and its training. The model is trained with a batch size of 4M tokens for 500,000 steps on 64 H100 GPUs, by distilling Gemma 2 9B (we replaced Gemma’s tokenizer by our own, and fine-tuned Gemma for 20,000 steps to adapt it to the new tokenizer). For the first 200k, we train on documents with a quality threshold higher than 0.2. We then increase this threshold to 0.25 for the next 200k steps, and finally increase it to 0.35 for the last 100k steps.

Model soups

A model soup is the idea of combining the parameters of multiple models, trained using different hyper-parameters or subset of the data, to obtain a model with better performance or better out-of-distribution generalization. The easiest way to obtain a model soup is to compute a uniform average of models. Other methods to combine the parameters of different models have been explored. In the following, we will only consider weighted average of model parameters to obtain the final model. To obtain the elements of the soup, we take the checkpoint of our main training run at the step 450k (out of 500k steps), and restart the training on subsets of the data, for the last 50k steps, with a reduced batch size of 500k tokens. Each subset of the data corresponds to a particular class of our quality filter: we keep documents such that the score of that particular class is higher than a threshold (in practice, we used 0.3). We report the performance of three specialized models, trained on the wiki, books and life subsets respectively, to illustrate the impact of the training data on performance.

Next, we can compute model soups, by averaging the parameters of individual models. In particular, we consider two soups: the first one is the uniform average of the main models, as well as the seven specialized models corresponding to high-quality classes. The second, which corresponds to Helium 1, is the weighted average of the main training run, the books model, the wiki model and the multilingual model, with the following weights: [2, 2, 1, 1]. We report the performance of these two models in the following table.

Multilingual results

Finally, we evaluate our models on diverse multilingual tasks, such as multi-choice and open-domain question answering, common sense reasoning and translation, covering 23 European languages other than English. More precisely, we evaluate on the translated variants of ARC, MMLU and HellaSwag distributed by the Eurolingua project, on the FLORES dataset and on MKQA.

Models and code

Helium 1 models on HuggingFace

Dactory pipeline on Github

Dactory models on HuggingFace