Announcing Helium-1 Preview

January 13, 2025 | Author: Kyutai Team

We are excited to release a preview of our new backbone language model called Helium-1. As the chemical element it is named after, Helium-1 is a lightweight model with around 2B parameters. Our goal with this model is to enable the development of A.I. systems running on edge and mobile devices. At Kyutai, we are believers that latency and privacy are two key elements when it comes to personal A.I. systems, and deploying models locally is a good way to achieve these. Helium-1 will be a multi-lingual language model: the preview version currently supports 6 languages, showing strong capabilities in those languages compared to existing open weights models. We will add support for more languages in the future.

Takeaways:

Today, we release Helium-1 preview, an initial version of our new backbone language model with 2B parameters, targeting edge and mobile devices.
Helium is a multi-lingual language model, currently supporting 6 languages (English, French, German, Italian, Portuguese and Spanish) which will be extended to more languages shortly.
In the coming months, we will release the full model that is currently under development, a technical report, and we will open-source the code for training the model and for reproducing our dataset.
We are looking forward to the feedback from the community, which will help us drive the development of Helium and make it the best multi-lingual lightweight model!
Get the model now, via the HuggingFace hub.

How we trained Helium-1 Preview

Helium-1 is based on the transformer architecture, and uses standard improvements such as pre-normalization with RMSNorm, rotary position embeddings and feed forward layers based on gated linear units with the SiLU activation. Overall, our architecture is almost identical to the one introduced by LLaMA 1, allowing an easy and straightforward deployment using existing tools such as MLX, vLLM, ollama or llama.cpp.

For our training dataset, we relied only on publicly available sources to make our dataset reproducible. We used curated data sources, including Wikipedia, Stack Exchange and scientific articles (peS2o) as well as filtered webpages from Common Crawl. Our pipeline to filter Common Crawl starts from a WARC dump, consisting of HTML webpages. We extract the textual content with resiliparse, then apply language identification with fastText, and perform deduplication at the paragraph level. Finally, to filter out low quality webpages, we train a fastText classifier on high quality content vs. random webpages from Common Crawl. This classifier is trained at the line level, and an aggregated score for the document is obtained by taking the weighted average score of each line (weighted by the length of the line). Following previous work, we also perform a data curriculum strategy by increasing the threshold of our quality filter and by including a subset of the high quality Dolmino mix for the last 100k steps.

Our model is trained on 2.5T tokens, with a context size of 4096 and a global batch size of 1024 sequences. We use token level distillation of a 7B parameters model to train Helium-1 preview.

Helium-1 Preview evaluations

We evaluate Helium-1 preview on standard benchmarks, including closed-book question answering, common sense reasoning, machine translation or multiple-choice question answering from high-school and college subjects. We include evaluations in English as well as the five other languages currently supported by Helium-1 preview: French, German, Italian, Portuguese and Spanish. We compare Helium-1 to other existing open weight models, with 1.5B to 3B parameters.

English Results

Bench	Helium-1 Preview (2.2B)	HF SmolLM2 (1.7B)	Gemma-2 (2.6B)	Llama-3.2 (3B)	Qwen2.5 (1.5B)

MMLU	51.2	50.4	53.1	56.6	61.0
NQ	17.3	15.1	17.7	22.0	13.1
TQA	47.9	45.4	49.9	53.6	35.9
ARC E	80.9	81.8	81.1	84.6	89.7
ARC C	62.7	64.7	66.0	69.0	77.2
OBQA	63.8	61.4	64.6	68.4	73.8
CSQA	65.6	59.0	64.4	65.4	72.4
PIQA	77.4	77.7	79.8	78.9	76.0
SIQA	64.4	57.5	61.9	63.8	68.7
HellaSwag	69.7	73.2	74.7	76.9	67.5
WinoGrande	66.5	65.6	71.2	72.0	64.8

Average	60.7	59.3	62.2	64.7	63.6

Multilingual Results

Language	Benchmark	Helium-1 Preview (2.2B)	HF SmolLM2 (1.7B)	Gemma-2 (2.6B)	Llama-3.2 (3B)	Qwen2.5 (1.5B)
	Average	42.1	27.8	42.3	43.6	40.0

German	MMLU	45.6	35.3	45.0	47.5	49.5
	ARC C	56.7	38.4	54.7	58.3	60.2
	HellaSwag	53.5	33.9	53.4	53.7	42.8
	MKQA	16.1	7.1	18.9	20.2	10.4
	FLORES	33.9	12.2	30.7	28.2	20.8
Spanish	MMLU	46.5	38.9	46.2	49.6	52.8
	ARC C	58.3	43.2	58.8	60.0	68.1
	HellaSwag	58.6	40.8	60.5	61.1	51.4
	MKQA	16.0	7.9	18.5	20.6	10.6
	FLORES	25.7	15.0	25.7	23.7	20.4
French	MMLU	46.0	37.7	45.7	48.8	51.9
	ARC C	57.9	40.6	57.5	60.1	67.4
	HellaSwag	59.0	41.1	60.4	59.6	51.2
	MKQA	16.8	8.4	18.4	19.6	9.7
	FLORES	44.3	20.0	43.3	39.3	31.2
Italian	MMLU	46.1	36.3	45.6	48.8	50.5
	ARC C	57.4	39.1	53.9	60.1	64.6
	HellaSwag	55.2	37.7	56.2	56.8	46.8
	MKQA	15.3	6.3	18.0	19.0	9.9
	FLORES	25.8	10.4	25.2	23.8	16.4
Portuguese	MMLU	46.2	37.7	45.6	49.2	53.0
	ARC C	56.8	40.6	57.0	62.1	66.6
	HellaSwag	57.3	41.0	58.7	59.1	50.9
	MKQA	14.7	6.6	16.9	19.1	9.2
	FLORES	43.0	20.0	43.6	40.5	33.0

	Average	42.1	27.8	42.3	43.6	40.0

What to expect in the future

This is a preview version of the 2B parameters LLM that we are currently developing and planning to release in the coming months. The full version of the model will support more languages, and will have better capabilities. We are also planning to release our training codebase in Jax, as well as the pre-processing pipeline to reproduce our training dataset. Feel free to reach out if you have any feedback about this model!

Model card

See HuggingFace repository.