EU's 24-Language AI and the Training Data Bottleneck

Executive Summary

On June 19, 2026, the European Commission selected the EUROPA consortium to build the bloc's own frontier AI model. The Italian company Domyn leads it, with Germany's Fraunhofer on board. The promise is not a simple one: an open-source model of more than 400 billion parameters, trained in all 24 of the EU's official languages, delivered within a year. The ambition is real. Whether it is achievable, though, turns on what the announcement said least about.

The number worth watching is not 400 billion. Maltese accounts for 0.03% of Common Crawl, the giant web corpus. The moment 24 languages became a condition, the real test stopped being the size of the model and became the training data for the smallest language. EuroHPC has carved out a year of compute, but who will collect the Maltese and Latvian corpora, and at what quality, has not yet been settled.

This gap is what turns 24 languages into a technical and political condition, not a feature. Weigh the state of low-resource-language data and the risk of betting a year on a single consortium, and the smallest number in the announcement outlasts the largest one.

The tension in the announcement lives in four numbers. The model is 400 billion-plus parameters; Maltese, the smallest official language, makes up 0.03% of web data; two American companies hold roughly 80% of the capital; and the secured compute runs for one year. The easy numbers and the hard numbers are sitting in the same room.

400B+

EUROPA model size

24-language open-source MoE

0.03%

Maltese share of web data

Common Crawl; low-resource scarcity

~80%

Capital held by two US firms

OpenAI + Anthropic, Forbes AI 50

1 year

EuroHPC compute support

2.5% of total capacity; data pacts TBD

1

What Just Happened

The European Commission named the EUROPA consortium the winner of its "Frontier AI Grand Challenge." The lead company is Domyn, headquartered in Milan, an outfit that has built AI for regulated industries and was formerly known as iGenius. It is run by CEO Uljan Sharka, born in 1992. Germany's research institute Fraunhofer joined as a core partner.

The terms of the grant are explicit. Build a Mixture-of-Experts model of more than 400 billion parameters, train it in all 24 of the EU's official languages, and release the weights as open source. For compute, the project gets 2.5% of EuroHPC's total supercomputing capacity for one year. Sharka called EuroHPC an "underrated strategic asset," arguing that training a frontier model once takes far less compute than serving it to hundreds of millions of people.

Why now? Behind it sits a concentration of capital. On the Forbes AI 50, two companies — OpenAI and Anthropic — captured roughly 80% of all money raised. In the first quarter of 2026, those two plus xAI and Waymo, just four firms, absorbed 65% of global venture investment. The words of Henna Virkkunen, the Commission's executive vice-president for tech sovereignty, lay the context bare: "Europe cannot remain a passive consumer of technology developed elsewhere."

European Commission headquarters — the Berlaymont building in Brussels, Belgium — ▲ The Berlaymont, Brussels — EU Commission headquarters where the EUROPA consortium selection was announced | Source: Wikimedia Commons (CC BY-SA 2.5)

2

Why 24 Languages Is a Condition

The "all 24 languages" requirement is not a marketing line. In the EU, language is a matter of citizenship. If a model is fluent in English and German but poor in Maltese and Latvian, citizens of those language communities become, in effect, second-class AI users. The risk the Commission's own documents flag is exactly that: the fewer the resources a language has, the worse the performance, and the flimsier the safety evaluation.

Aerial view of Valletta, Malta — the EU member state with the smallest official language by web data share — ▲ Valletta, Malta from the air. Maltese accounts for 0.03% of Common Crawl — the entire internet's text data | Source: Jonathan Mercieca / Wikimedia Commons (CC BY-SA 4.0)

So language equality is both a technical and a political condition. The EU would struggle to wave a model through its regulatory gate if that model prioritized English and bolted on the rest. With the AI Act in force, "fair multilingual performance" is closer to a requirement than a choice. Nailing 24 languages down as a condition from the outset reads as a judgment that the requirement cannot be met by retrofitting languages later.

The crux is not translation ability. To treat 24 languages as equals, the model has to be trained well enough in each one. And training is a matter of data. The moment the condition is set in terms of language, the bottleneck moves automatically to data.

3

The Real Bottleneck Is the Corpus

The numbers sketch the shape of the problem. In Common Crawl, the standard source of web data, Maltese makes up 0.03% of the total, Irish 0.07%, and Latvian 0.09%. Add up the entire lower-resourced half of the EU's languages and you still do not reach 2.4%. Scraping together a small language from an internet filled with English is a different kind of labor from training a large model.

The data composition of EuroLLM 22B, regarded as the largest open-source European model, makes the reality plainer. English is 50%, the five major Western European languages are 27%, high-resource global languages are 14%, and all the remaining EU languages combined come to just 9%. Even up-sampling low-resource languages by as much as 2.5x did not erase that imbalance. The 400 billion parameters EUROPA is aiming for is about 18 times the size of EuroLLM 22B, but scaling up the model does not conjure Maltese sentences that were never there.

▲ EuroLLM 22B training data composition. All other EU languages combined reach just 9% — scaling the model up does not fix this imbalance

Quantity is not the only problem; quality and breadth of expression are too. Speech is a stark example. In Maltese automatic speech recognition, OpenAI's Whisper posted the lowest performance among its supported languages, and the publicly available supervised speech data amounts to just 16 hours. That is too little for the model even to converge, forcing teams to pull in machine-labeled data as a stopgap. The EU's official target is at least one billion tokens per low-resource language — and where those billion tokens will come from is a question that has to be solved before 400 billion parameters.

Sharka said the consortium would sign data pacts with European national governments within weeks. The direction is right, but a contract and a quality guarantee are not the same thing. Documents held by a government do not automatically become a clean corpus ready for training. The governance of which data, by what standard, cleaned and verified by whom, is not yet visible.

4

The Structural Risk of a Single Bet

The structure itself invites scrutiny: one consortium, one year of compute, and undisclosed funding. Investors named include Abu Dhabi's G42, Eurizon Capital, Rabobank, and BNY, but the amount raised has not been revealed. There is a wide gap between saying "we announced a 400B model" and "a 400B model researchers can actually download and use now exists."

Precedent counsels caution too. France's Mistral, over more than three years and billions of euros, has not fully closed the gap with American frontier models. This is not the kind of thing that ends with a single commission and a one-year deadline. Real sovereign AI infrastructure comes not from a one-year contract but from a sustained system that keeps collecting, cleaning, and verifying data.

To put it plainly, EUROPA has secured one axis — compute. But the other axis, the data to hold up 24 languages, is still being designed. That the number the announcement played up most is the easy part, and the part it barely touched is the hard part, sums up the tension of this project.

5

Why It Still Matters

Naming the risks does not make the commission meaningless. If anything, its value as a signal is large. Europe has declared "we can build it too," and it has put a public asset — its supercomputers — to work as a real resource. Positioning EuroHPC as an "AI public good" holds up as a long-term infrastructure strategy, quite apart from any short-term result.

That said, it is easy to lose the way if you set the bar for success at the launch of a single model. The real finish line is not the day 400 billion parameters of weights are released, but whether a system is in place to build and maintain data for each of the 24 languages at a known quality. A model is trained once and done; data keeps accumulating, rotting, and being refreshed. Sovereignty arrives only when you can manage that flow with your own hands.

Editor's Note

Pebblous reads this announcement from a data perspective for a clear reason. To train a model properly in any language, that language's data first has to be in a state worth training on. Checking whether there is enough of it, whether the quality is even, whether bias is absent, and whether provenance is traceable comes before scaling the model up. The proposition that the bottleneck for sovereign AI is not compute but "ready national-language data" sits in exactly the same place as the on-the-ground problems Pebblous faces in its work on data quality verification.

R

References

Academic Papers

1.EuroLLM Consortium. (2025). EuroLLM: Multilingual Language Models for Europe. arxiv.org/abs/2506.04079
2.University of Malta NLP Group. (2022). BERTu: Pre-training BERT for Maltese. arxiv.org/abs/2205.10517

Official Documents

3.European Commission. (19 June 2026). Commission selects EUROPA consortium winner of Frontier AI Grand Challenge project to build a European open AI model. digital-strategy.ec.europa.eu
4.European Commission. Language data and AI: using AI to break down language barriers. translation.ec.europa.eu

Industry & Press

5.European Express. (19 June 2026). Europe chooses its own frontier AI builder. european.express