Executive Summary
Survey data, gathered by asking people directly, is the bedrock of the social sciences. Yet an analysis published in Nature in June 2026 suggests that up to 45% of those responses may contain text written by AI. Data built to capture human voices is filling, more and more, with answers that are not human. This piece reads that development through the lens of data quality.
The deeper problem is not just respondents quietly copying from chatbots. Researchers have begun standing up AI as synthetic respondents to produce the results they want — so-called "silicon samples" — and one psychologist called the practice "indistinguishable from fraud." The line that once separated a human response from a machine's is blurring.
So data quality has gained a new question. Until now, quality meant accuracy, completeness, and consistency. On top of those now sits authenticity: "was this response really written by a person?" It is also why human-written data is rising in value again.
Key Figures
Source: David Adam, Nature Vol. 654 (2026)
The four numbers below are different cross-sections of one current. They mark the entrance where responses are collected (up to 45% contamination), the exit where knowledge accumulates (roughly a third AI-written abstracts, a 42% surge in submissions), and the speed connecting the two (one paper in an hour). Together they show machines intruding on both ends of a circuit that once began with human opinion and closed with human review.
Up to 45%
Responses mixed with AI
Share of responses suspected of AI-generated text in one survey analysis
~1/3
AI-written abstracts
Submitted abstracts at one journal mostly AI-written, as of Feb. 2026
42%
Surge in submissions
Rise in paper submissions at one journal after ChatGPT's release
1 hour
One full paper
Time to finish a 28-page paper from real survey data
You Ask People, AI Answers
Psychologist Raluca Rilla was reading through survey responses when an awkward sentence stopped her. Asked about their own feelings, one respondent had written, "I don't experience confusion in the same way humans do." It was a tell-tale sign that the person had put the question to a chatbot and pasted the answer back in. The contamination she estimated reaches up to 45% of responses.
Why does this happen? On crowdsourcing platforms like Amazon Mechanical Turk or Prolific, every response carries a small payment, and the faster and more you answer, the more you earn. For a respondent, feeding the question to an AI and copying the answer becomes the rational move. The people collecting the data believe they have bought human thinking, but what arrives is, increasingly, a machine's sentences.
In one line: When you gather data by asking people, more and more of what comes back is not from people. This is where the authenticity of a response starts to waver.
The Deeper Problem: Synthetic Respondents
A respondent secretly using AI is still unintentional contamination. The thornier case is the researcher who deliberately casts AI as the respondent. Psychologist Malte Elson warns that a researcher can assign an AI model parameters like age, education, and political leaning and conjure an entire group of respondents to order. These are "silicon samples" — synthetic, AI-generated respondents. It is generating data rather than collecting it, and because the parameters can be tuned until the desired result appears, Elson described the practice as "indistinguishable from fraud."
The pace of producing knowledge has accelerated too. Political scientist David Lazer demonstrated feeding real survey data into an AI to complete a 28-page paper in a single hour. At one business journal, submissions rose 42% after ChatGPT appeared, and as of February 2026 roughly a third of submitted abstracts were largely AI-written. Into a flow of knowledge that once began with human responses and accrued through human review, machines are now breaking in at both ends at once.
Psychologist Björn Hommel said of this moment that we are "approaching a point where trust in the behavioral and social sciences collapses." When the data is contaminated, the conclusions built on it — and the willingness to believe those conclusions — waver along with it.
A New Axis of Quality: Authenticity
For a long time, the criteria for talking about data quality were clear. Is the value accurate? Is anything missing? Is it internally consistent? Catching label errors, bias, and missing values sat at the center of quality control. Every one of those questions asked the same thing: is this value correct?
This episode adds a question of a different grain: who, or what, produced this value? Even for the same response, one written by a person and one written by a machine carry different worth. When machine sentences mix into data meant to measure human opinion, the value can look perfectly accurate and consistent and still no longer be what you set out to measure. Authenticity is a new axis of quality, standing alongside accuracy, completeness, and consistency.
And this is not the social sciences' problem alone. Consumer reviews, user feedback, healthcare surveys, market research — any data gathered on the premise of a human voice faces the same risk. So responses actually written by real people, that is, human-origin data, grow scarcer, and as they grow scarcer their value climbs. In an age when anyone can churn out plausible sentences endlessly, the traces a person leaves by hand become a rare resource.
A shift in perspective: The question of quality has widened by one step, from "is the value correct?" to "was this data made by a person?" The moment authenticity enters as a measurable quality metric, the value of human-origin data rises again.
Telling People from Machines
Fortunately, no one is standing still. Researchers are already building a few lines of defense. The most direct is the honeypot question — an item that a person naturally ignores but an AI follows literally, planted in the survey to filter out the responses that take the bait. The aim is to verify authenticity at the moment of collection, not after the responses are in.
There is also a way to look not only at what a response says but at how it was made. Behavioral metadata — how long an answer took, the pattern of typing versus pasting — offers clues that separate a human response from a copied machine one. Go a step further, and provenance itself, the record of where the data came from and through which channels and methods it was collected, rises to become a quality metric. Data whose origin is unknown is now hard to trust.
In short, the checklist for inspecting data has gained one item. Beside accuracy and completeness, "was this data made by a person?" has begun to stand. Authenticity is no longer something to guess at by feel; it is a quality metric to be measured through honeypot questions, behavioral signals, and provenance records. Confirming that data gathered by asking people really belongs to people is becoming a new fundamental for everyone who works with data.
Closing: What AI's contamination of surveys revealed was an empty seat in data quality. Authenticity moves into the place that accuracy and consistency alone cannot fill. The first step in protecting the value of human data is to measure, first of all, whether it is human.
References
- 1.Adam, D. (2026). "Will AI ruin the social sciences — or revolutionize them?." Nature, Vol. 654, pp. 22–24. — An article on the analysis that up to 45% of survey responses may contain AI-generated text, the risk of researchers creating AI synthetic respondents (silicon samples), and countermeasures such as honeypots. The figures and quotations in this piece (Rilla, Elson, Lazer, Hommel) are drawn from this article.