Executive Summary

Run a personality test on a language model and a sharp profile comes out. One model reads as extraverted, another as neurotic. But a study that re-examined 56 models with proper psychometric methods says those profiles are not measuring personality at all. The paper, by Meyer, Garcia, and Wulff, went up on arXiv in June 2026.

Between 81% and 90% of the personality differences across models came not from real traits but from a habit of answering surveys — a measurement bias. In humans, only 9% to 16% of the same gaps trace back to that habit. What the personality test captured was not a model's inner life but its steady tendency to agree with, or push back on, the items put in front of it.

This piece reads that finding through the lens of data quality. We believe we are measuring the emotions and personalities of AI with growing precision, but if the ruler itself is bent, the score is not data — it is noise.

Key Figures

Source: Meyer, Garcia & Wulff (2026), arXiv:2606.20205

The four numbers below are different cross-sections of the same conclusion. Most of a model's personality difference is bias (81–90%); the bias shows itself in how models answer mirror-image items (forward–reverse correlation of +0.7); the result was confirmed at scale (56 models); and as a result the profile shifts far too easily (up to 0.99 standard deviations).

81–90%

Of the gap is bias

Share of between-model personality differences driven by response bias, not real traits (humans: 9–16%)

+0.7

Forward–reverse correlation

On oppositely worded items, LLMs answer in the same direction (humans: −0.7)

56

Models tested

46 open-source models plus 10 from the GPT, Claude, Gemini, Qwen, and Grok families

0.99 SD

How far profiles drifted

Largest standard-deviation gap a single model's personality opened up depending on which items were used

1

Personality Tests on 56 Models

The researchers gathered 56 instruction-tuned language models in one place. Forty-six open-source models ranging from 1B to 70B parameters, plus ten commercial models from the GPT, Claude, Gemini, Qwen, and Grok families. They ran the same standard instrument used to measure human personality: the Big Five. It is the very questionnaire that has measured human dispositions along five axes — openness, conscientiousness, extraversion, agreeableness, and neuroticism. They also administered a risk-preference survey (DOSPERT) and a moral-foundations questionnaire, and set a large human sample alongside the models for comparison.

At first glance the picture was crisp. Each model showed a distinct personality profile. Some scored high on openness, others stood out on agreeableness. Cronbach's α, which gauges a test's internal consistency, landed between 0.85 and 0.96 — as stable as anything you would see in humans. Taken at face value, this was exactly the kind of data that invites the conclusion "language models have personalities too." And over the past few years, no small number of studies went down that road.

So far, a familiar story: run a personality test on a model and a personality comes out. The trouble starts next, when the researchers added one simple check.

2

Asked in Reverse, the Models Still Agreed

A well-built personality test mixes in items that ask about the same trait in reverse. If "I am full of curiosity" is a forward item, "I have no curiosity" is the reverse item. The two ask about the same disposition in opposite directions. So a person with a consistent personality should agree with one and reject the other. Someone curious answers "yes" to the first and "no" to the second. The two responses move in opposite directions, which shows up statistically as a negative correlation.

The human sample did exactly that. Responses to forward and reverse items correlated between −0.69 and −0.82 — negative. The answers flipped consistently with the direction of the item, and that consistency is the very signal we call personality.

The language models were the opposite. On the same two items, their responses correlated between +0.61 and +0.81 — positive. A model that agreed with "I am full of curiosity" also agreed with "I have no curiosity." The content was reversed, but the answer went the same way. The models were not tracking what the item asked; they were holding to a steady habit of reacting to the survey itself. What stayed consistent was not personality but a response habit.

−1.0 −0.5 0 +0.5 +1.0 Humans −0.69 to −0.82 LLM +0.61 to +0.81 Responses flip with content → Consistent personality Responses go the same way → Consistent response habit Pearson r: forward × reverse item responses
▲ Pebblous original diagram (reinterpretation of Fig. 2) — Humans: negative correlation (personality consistent); LLMs: positive correlation (response habit consistent) | Source: Meyer et al. (2026)

Separating the two turns out to be surprisingly simple. Subtract the reverse responses from the forward ones and what remains is the real disposition; add them and what remains is the direction-blind response bias. Applied across the 56 models, this decomposition put 81% to 90% of the between-model differences on the bias side. In humans, only 9% to 16% of the same gaps were explained by bias. What set the models apart was not a difference in disposition but a difference in how they habitually answered the survey.

What Drives Between-Model Personality Differences LLM Response bias 81–90% Trait Humans Bias Genuine trait 84–91% 81–90% of LLM differences are bias — only 9–16% for humans 0% 100% Response bias Genuine trait
▲ Pebblous original diagram (reinterpretation of Fig. 3) — Bias (response habit) vs. trait (genuine disposition) decomposition via forward–reverse subtraction | Source: Meyer et al. (2026)

The paradox: the higher a model's internal consistency (α 0.85–0.96), the more consistently it was, in fact, measuring bias rather than personality. The old warning that a high reliability figure does not guarantee a valid measurement shows up here in its sharpest form yet, in language models.

3

A Personality You Can Choose

What happens when bias and disposition are tangled together? The researchers measured each model twice. Once they drew the profile from forward items only, once from reverse items only. Same model, so the results should match. Instead the two profiles drifted apart by as much as 0.99 standard deviations. Depending on which items were chosen to build the test, the same model was measured as a completely different personality.

Same model Depending on which items are chosen Forward items only Profile A Reverse items only Profile B Up to 0.99 SD apart Same model · same trait · different item direction → different score
▲ Pebblous original diagram (reinterpretation of Fig. 4) — Measuring the same model with forward-only vs. reverse-only items produces profiles up to 0.99 SD apart | Source: Meyer et al. (2026)

Put the other way around, whoever picks the items can shape the model's personality at will. Want it to look extraverted? Choose the items that come out that way. Want it to look cautious? Choose the items that come out that way. None of this happens with humans. Because personality actually exists, the profile barely moves no matter which direction of items you measure it with. When the measured value is tied to the object rather than the instrument — that is real measurement.

So what were all those studies measuring when they reported that "language models have personalities"? Many of them used instruments with few reverse items. Without reverse items, the disposition signal and the bias signal overlap in the same direction, and there is no way left to pull them apart. What this study found across the 56 models is that the lower the share of reverse items in a test, the higher its internal consistency α came out — the correlation between the two was r = −0.95, the two moving almost as one body. The clean, reliable-looking profiles were in fact the product of tests that failed to filter out bias. Profiles produced in that state were probably not portraits of personality but shadows of bias. The bias did shrink as models grew more capable, but it did not vanish even in the strongest models.

The core point: if you get a different value every time you measure, and you can shape the result by choosing items, then the score is not a property of the object but a product of the instrument. A language model's "personality" was never inside the model — it was a shape the instrument produced at the moment of measurement.

4

Who Measures the Measurer?

When we talk about data quality, we usually look at the data itself. Are the values accurate? Is anything missing? Are the labels right? But this study points one level up. If the instrument producing the data is bent, then no matter how cleanly you tidy the values that come out of it, you started by measuring something else. Even with a personality test at α 0.96, if it consistently measured a response habit rather than personality, the data is just precise noise.

This problem does not stop at personality tests. We measure AI with more and more instruments: reasoning benchmarks, safety evaluations, alignment scores, emotion classifiers. Every one of them asks a model something and assigns a score from its response. Yet how much those instruments themselves are swayed by a model's answering habits rarely gets verified. Who is measuring the quality of the thing that measures models? What needs to come on the line after data quality is the quality of the instrument that produces that data — measurement quality.

It is not as if there is no method. The forward–reverse check this paper used is precisely an instrument for verifying instruments. By watching whether responses flip when you ask the opposite, you can tell whether a score measured the object or the instrument's habit. The practice of testing the instrument before trusting the measurement is, in fact, an extension of what data-quality work has long done. It is one step beyond doubting the data — it is doubting the thing that produced the data.

The story that AI has a rich inner life is an appealing one. The narrative that a model feels emotions and holds a personality makes what we built feel more familiar, and sometimes more frightening. This paper is a quiet counterexample to that narrative. It says the score we thought showed us that inner life may have been the markings on the ruler we were holding. To understand AI accurately, the instrument that measures AI has to be accurate first.

The takeaway: when the instrument is biased, the score is not data but noise. The question that once asked about the quality of the data now moves up one line, to the quality of the measurement. Before we believe we are measuring the inner life of AI, verifying the measurer first — that is the work that has to come on the next line.

R

References

Primary Source

Prior & Related Work