The Real Reason Claude Got Dumber — Anthropic Harness Postmortem

Executive Summary

From March through April, Claude users reported widespread performance degradation. On April 23, Anthropic published a postmortem. The model weights had not changed. Three modifications to the harness — the system prompt, caching logic, and reasoning effort setting — had been deployed at different times, to different traffic slices. Individually, each was minor. Together, they produced what looked like "broad, inconsistent degradation."

This incident demonstrates that an AI model's performance is not determined by its weights alone. A single line in a system prompt, a caching policy, a default reasoning depth — any one of these can fundamentally alter the user experience.

A harness is not a model's clothing. It is the model's sensory apparatus. When the senses break, a perfectly healthy brain cannot perceive the world. And it was users, not internal testing, who noticed. This piece is the harness-postmortem chapter of the Claude Watch series — looking past the model itself to the microstructure that actually shapes behavior.

Here are the key numbers from this incident at a glance.

7 wks

Duration of issues (3/4 - 4/20)

3

Simultaneous harness changes

3%

Coding quality drop (verbosity limit)

0

API users affected (product layer only)

1

They Turned Down the Thinking

On March 4, Claude Code's default reasoning effort was lowered from high to medium. The goal was to reduce UI latency. Users noticed almost immediately.

"Claude got dumber."

Medium means thinking less. Responses come faster, but accuracy drops on complex problems. In the tradeoff between speed and intelligence, Anthropic chose speed. Users wanted intelligence.

It was reverted on April 7. Currently, Opus 4.7 defaults to xhigh, and other models default to high.

Anthropic's own assessment: "This was the wrong tradeoff."

2

The Memory-Erasing Bug

On March 26, a bug shipped during a cache optimization deployment. A cleanup routine meant to run once started running on every turn. The result was amnesia and repetition. Claude would repeat itself and forget context from just moments before.

This bug passed code review, unit tests, end-to-end tests, automated validation, and internal dogfooding — every gate in the pipeline. It was finally caught by an Opus 4.7 code review. The 4.6 model missed it entirely. A smarter model caught a subtler bug.

It was fixed on April 10.

3

"Keep It Under 25 Words"

On April 16, a system prompt change went live. The new instruction: "Keep responses between tool calls under 25 words, and final responses under 100 words."

The intent was to trim unnecessary chatter between tool calls. It passed weeks of internal testing. But once deployed at scale, coding quality dropped by 3%.

Telling a model to speak less made it think less. Compressing the output compressed the reasoning.

It was reverted on April 20.

4

Three Shadows, One Darkness

The three changes were deployed on different dates, to different traffic percentages. Each was a small issue on its own. But their combined effect manifested as "inconsistent degradation" — the most confusing kind of outage. Some users noticed problems. Others were fine. Pinpointing the cause was nearly impossible.

The API was unaffected. The problems lived entirely in the product layer — the harness, the system prompt, the caching logic. The same model that worked perfectly via API was misbehaving in the product.

Below is the full timeline of events.

Date	Event	Impact
Mar 4	Reasoning effort high → medium	Claude Code intelligence drop
Mar 26	Cache cleanup bug deployed	Amnesia, repetition
Apr 7	Reasoning effort reverted	Partial recovery
Apr 10	Cache bug fixed	Memory restored
Apr 16	25-word limit deployed	3% coding quality drop
Apr 20	25-word limit reverted	Full recovery
Apr 23	Postmortem published	Usage limits reset

5

What This Incident Reveals

Anthropic announced concrete remediation. They reset usage limits for affected users and launched @ClaudeDevs, a channel for real-time product change announcements. They committed to dogfooding public builds internally and tightening controls around system prompt changes.

But the deeper lesson is not about organizational response. It is about structural truth.

A harness is not a model's clothing. It is the model's sensory apparatus. When the senses break, a perfectly healthy brain cannot perceive the world.

The system prompt determines how a model perceives the world. The cache determines how it retains memory. The reasoning effort setting determines how deeply it thinks. All three sit outside the model weights, yet the user experience is defined by them.

And it was users who discovered the problem. Internal testing, automated validation, code review — none of these caught the combined effect of three simultaneous changes. It was everyday practitioners, running real workflows, who detected the subtle degradation. The people who use the model know the model better than the people who build it. An old truth, repeated once more.

References

Anthropic Engineering — April 23 Postmortem — Official postmortem on Claude performance degradation causes and response
VentureBeat — Mystery solved: Anthropic reveals changes to Claude's harnesses — Coverage of harness changes as the root cause of degradation
The Register — Anthropic says it has fixed the issues — Coverage of fixes and preventive measures