AI

Why LLMs Perform Better in Italian, French, and Polish Than in English

New multilingual AI research shows that Romance and Slavic languages outperform English in long-context reasoning. Here's why fusional morphology and information-dense grammar give LLMs clearer signals and higher accuracy.

By Reuben LopezNovember 25, 202510 min read
LLM Language Performance Comparison

Artificial intelligence researchers have been quietly uncovering a surprising truth about how large language models behave across different languages.

Despite being trained largely on English and Chinese, LLMs do not perform best in those languages.

New research from the ONERULER benchmark — a multilingual long-context evaluation created by the University of Maryland, Microsoft, and UMass — shows that Latin-based and Slavic languages actually outperform English, especially in tasks involving long sequences of text.

This includes:

  • Italian
  • French
  • Spanish
  • Portuguese
  • Polish
  • Russian

According to the ONERULER heatmaps (Figure 3, Page 5) and aggregated language rankings (Figure 4b, Pages 6–7), these languages consistently rank above English in retrieval accuracy, reasoning consistency, and context retention over 64,000–128,000 token windows.

Below are the core insights that explain why this happens — and why it matters for anyone using AI tools today.


1. Latin-based languages encode more meaning per word

Romance languages (Italian, French, Spanish, Portuguese) contain more information inside each word than English does. This is known as morphological richness, and it gives LLMs more signals to work with.

In the ONERULER benchmark, Romance languages sit near the top across all tested models, including Gemini 1.5 Flash and Qwen 2.5.

Why does this help?

Because when every word carries grammatical hints — gender, number, verb tense, sentence role — the model doesn't have to guess.

English, by comparison, hides much of that information and relies heavily on word order and context, two things that become harder for an LLM to track in long passages.

A simple example

English:

"I saw my friend."

This sentence gives the model no clues about:

  • gender
  • plurality
  • the role of the word "friend"

Spanish:

"Vi a mi amigo."

or

"Vi a mi amiga."

Spanish communicates:

  • masculine or feminine
  • the object of the action (via "a mi…")
  • the subject ("vi" already encodes "I")

All in fewer words.

The richer the signal, the fewer interpretations the LLM must juggle — which boosts accuracy.


2. Slavic and Romance languages dominate because they're "fusional"

The ONERULER study highlights that the top-performing languages all share a key trait: they are fusional languages.

A fusional language is one where the endings of words change to express grammatical meaning. A single ending can encode multiple pieces of information at once.

What this means in practice

In languages like Polish, Russian, Italian, Spanish, and French, the end of a word can signal:

  • gender
  • case (role in the sentence)
  • number
  • tense
  • mood
  • sometimes even prepositional relationships

This structure does two huge things for LLMs:

1. It reduces ambiguity (less guessing)

English forces the model to infer meaning from context or strict word order.

Fusional languages embed meaning directly into the word.

2. It strengthens long-context memory

When a model reads a 64k–128k token passage, every word-ending acts like a breadcrumb trail that anchors relationships between nouns, verbs, and ideas.

This is why Polish ranks #1 in long-context accuracy across models, with Romance languages close behind.

Morphology becomes a built-in compass that keeps the model from drifting.


3. The language you prompt in can significantly change model accuracy

Most people think prompt engineering is all about:

  • formatting
  • step-by-step instructions
  • chain-of-thought
  • rules and constraints

But ONERULER's cross-lingual tests reveal that the language you choose can change accuracy by up to 20%, even when the underlying content is identical.

(Figure 6a, Page 8)

Why?

Because the model:

  • assigns different weights to grammatical signals in each language
  • interprets structure differently
  • accesses different learned representations
  • encounters different ambiguity levels

This opens a new—and massively under-discussed—strategy:

Multilingual prompting.

If you want:

  • better reasoning → try Italian
  • better retrieval → try Polish
  • more stable context tracking → try Spanish or French
  • fewer hallucinations → many fusional languages outperform English

Then translate the answer back to English at the end.

It's a tiny shift that can produce disproportionately better results.


Conclusion: An overlooked edge in AI productivity

The ONERULER benchmark makes one thing clear:

The choice of language is just as important as the structure of the prompt.

English may dominate the internet, business, and training corpora — but linguistically, it gives LLMs the least help.

Fusional languages like Italian, Spanish, French, Russian, and Polish pack more meaning per word, reduce ambiguity, and help models maintain accuracy over long sequences.

This gives multilingual users a new advantage and offers everyone else a new layer of prompt strategy they can apply immediately.

In an AI landscape where everyone has access to the same tools, small edges compound.

Multilingual prompting may become one of the simplest and most effective.


Related Reading

Explore more AI tools and workflows:

Ready to build your content engine?

Get a free 20-minute audit of your current processes and discover which workflows you can automate today.