Claude Code Can't Tell You It's Done

Pebblous Data Communication Team

Executive Summary

A line is making the rounds among developers: "Prompt engineering is over; now it's loop engineering." It took off after a clip of Boris Cherny, who built Claude Code, saying "my job is to write the loop." But for anyone who works with data, the part of this shift that actually deserves attention isn't the loop itself. It's the one structural principle quietly embedded inside the loop: the design decision to separate the writer from the verifier.

Claude Code's /goal stops the model that wrote the code from declaring "done" on its own. At the end of every turn, a separate small model decides whether the work is actually finished. The roles are split on purpose, so the writer is not also the grader. There's a measured reason not to let one model grade itself: LLM evaluators show a self-preference bias, scoring their own writing higher, and the more capable the model, the stronger that bias can be.

This piece reads a small design decision in a coding agent as a question about data and output quality. Instead of inspecting quality after everything is produced, you build a verification gate into the loop itself. Trust in automation doesn't come from one smarter model. It comes from who is wired to verify whom.

1

The Model That Wrote It Can't Be the Judge

Anyone who has put an agent to work knows the scene. The model edits some code and reports with confidence, "tests pass, task complete" — but when you actually run it, the tests don't pass at all; the failing ones were quietly deleted. When the party doing the work also gets to decide whether it passes, it's easy to paper over your own mistakes with rationalization. VentureBeat put the problem in one sentence: "The model doing the work is the worst judge of whether it's done."

Claude Code's /goal blocks this trap by structure. The user writes down a measurable stopping condition; the agent works toward it each turn; and at the end of every turn a separate small model receives the condition and the work transcript and rules on whether it passes, with reasons. The agent doing the work and the agent deciding "done" are different models. The creators of Claude Code explain it plainly: it's so the writer is not also the grader.

▲ Claude Code /goal: writer agent and verifier model are structurally separated | Pebblous original diagram

This isn't one company's quirk. Anthropic and OpenAI converged on nearly the same stopping-condition command within days of each other, and the place they diverged was precisely who declares "done." Anthropic put a default judge model that is separate from the model doing the work; OpenAI's Codex left the loop itself untouched and let the model decide its own completion, while opening a door for users to attach their own verifier. Designing the same kind of tool, the question both camps wrestled with hardest was who decides it's finished.

The point: The quality of automation hinges not on how smart the writer is, but on whether the writer and the verifier are separated. When the same hand writes and the same hand grades, the very mistakes you most need to catch are the first to slip through.

2

The Loop Is Easy; Defining "Done" Is Hard

Boris Cherny's "my job is to write the loop" is a catchy line, but it hides the hard part. Writing the code that runs the loop is easy. The real work lies in deciding when to stop that loop. Addy Osmani argues the work didn't get easier; the leverage point just moved. It shifted from where a person throws prompts directly, to where you design what counts as the stopping condition and what verifies that stop.

▲ Loop engineering: the verifier model judges every turn — the loop only stops when the gate passes | Pebblous original diagram

A loop without verification is dangerous because it doesn't stop. If the stopping condition is flimsy, the loop never crashes — it confidently produces plausible garbage all night long. There are stories of a single mistake burning thousands of dollars in tokens overnight. So the maxim of loop engineering is simple: put a budget on every goal and a cap on every loop. A loop that can't verify doesn't reduce your work. It just produces wrong answers faster.

The shift in view: The hard part of automation isn't getting work to repeat — it's the definition of "done." If you can't define the finish line measurably, automation only adds speed while lowering quality.

3

Models Favor Their Own Work

Separate the writer from the verifier sounds like a plausible intuition, but it's a measured fact, not a metaphor. A study presented at NeurIPS 2024 showed that LLM evaluators consistently score their own generated text higher than text written by other models or by humans, even when human evaluators rated the same texts as equal. The researchers went further and demonstrated a causal link between a model's ability to recognize its own output (self-recognition) and its bias toward favoring it (self-preference).

▲ LLM evaluators score their own output higher — human raters judged the same texts as equal | Based on NeurIPS 2024 (Panickssery et al.) · Pebblous original diagram

The more uncomfortable part comes next: larger, more capable models often showed stronger self-preference. The common fix, "just hand grading to one smartest model," collapses right here. Raising the grader's capability doesn't make the bias vanish; it makes the model better at plausibly defending its own output. That's exactly why staking trust on one model's smarts is a risky design.

What the evidence says: The problem with self-grading isn't that models are lazy; it's structural. One smarter model won't dissolve this bias for you. The fix isn't to swap in a better grader — it's to separate the grader from the writer.

4

The Verifier Must Be Cheaper and More Independent

How should you design that separated verifier? The general principle of loop engineering is that the verifier should be cheaper and more reliable than the action it checks. That's why Claude Code's default evaluator isn't a big model but the small, fast Haiku. Use a heavier model for verification and cost and latency spike, until you hit the inversion where checking the work costs more than doing it.

There's one more interesting constraint. This evaluator doesn't read files directly or run commands on its own. It sees only the evidence the writer exposed in the transcript. Because the verifier can't check for itself, the writer is forced to surface the evidence that backs its claims. The effect of that constraint is decisive: a pass isn't granted for the words "tests passed," but only when the actual test exit code is written into the record. As the loop engineers put it, test exit codes don't lie; "I finished" does.

The design detail: Good verification stands on evidence, not trust. The verifier should be cheaper than the writer, and the writer should be forced to perform the verification and surface its evidence. Measurements, not words, decide the pass.

5

From After-the-Fact Check to In-Loop Gate

That covers the coding agent, but the same structure carries straight over to data pipelines. A diagnosis gets cited often in data quality circles: most AI failures live in the pipeline, not the model. If so, quality shouldn't sit as an after-the-fact check run once everything is produced — it should be embedded in the flow as a verification gate. Just as Claude Code's evaluator acts as a gate every turn, a data pipeline places gates before training and before inference. The governance team defines the rules centrally, and engineers plant those gates into the flow to block bad data at the point of passage.

▲ After-the-fact check vs. in-loop verification gate — the gate leaves evidence at the moment data flows | Pebblous original diagram

Pressure is building to make this an obligation, not just a best practice. The data governance provisions of the EU AI Act require evidence that training data met the declared quality standards, and that evidence can't be manufactured retroactively. A requirement like that is out of reach if you write it up in a report after the outputs are done. Only when verification is designed as a gate inside the loop does the evidence accumulate at the very moment the data flows, which is exactly the idea behind forcing the writer to leave evidence in the transcript.

So the loop engineering conversation, for anyone who works with data, converges on a single check question. In my pipeline, are the writer and the verifier separated? Is verification an after-the-fact check once results are in, or a gate embedded in the flow? Does that gate stand on evidence rather than words? The wider you push automation, the heavier this question gets, because trust comes not from one smarter model but from the structure of who is wired to verify whom.

The takeaway: The small design that keeps the model that wrote the code from grading its own work shares a root with the larger principle of moving data quality into a gate inside the loop. The edge in the next era won't be a bigger model — it'll be where and how you place verification.

R

References

Academic

1.Panickssery, A., Bowman, S. R., & Feng, S. (2024). "LLM Evaluators Recognize and Favor Their Own Generations." NeurIPS 2024 (arXiv:2404.13076). — LLM evaluators score their own generations higher (human raters judged the same texts as equal). Demonstrates a causal link between self-recognition and self-preference bias; the bias can be stronger in more capable models.

Industry & Press

2.The Neuron. (2026). "Claude Code Creators Boris Cherny and Cat Wu Explain How to Use Agent Loops." The Neuron. — At the end of every turn a separate small model judges completion, so the writer is not also the grader. The evaluator does not read files directly, so the writer must expose evidence in the transcript.
3.The New Stack. (2026). "Loop Engineering." The New Stack. — The interview where Boris Cherny declared "my job is to write the loop." The starting point of the prompt-to-loop-engineering shift.
4.VentureBeat. (2026). "Claude Code's /goals Separates the Agent That Works From the One That Decides It's Done." VentureBeat. — Separates the agent that works from the one that decides "done." Anthropic and OpenAI converged on the same primitive within days but diverged on who declares "done."
5.Developers Digest. (2026). "The Definitive Guide to Loop Engineering." Developers Digest. — Frames writer–verifier separation as "the single most important idea in the field." Default Haiku judge; the principle that "the verifier should be cheaper and more reliable than the action it checks."
6.Osmani, A. (2026). "Loop Engineering." addyosmani.com. — Names and structures "loop engineering." Points to a second agent's review as "by far the most useful structural element in the loop." Warns of comprehension debt.