It's the Weak Harness, Not the Weak Model, That Kills Your Agent

Pebblous Data Communication Team

Executive Summary

Most agent systems die from a weak harness, not a weak model. The harness is the outer structure wrapped around the model: the loop, the state, the separation of roles, the gates that decide when it stops, when it restarts, and where its output goes. This isn't rhetoric; it's a measured fact. Hold the model fixed and swap only the harness, and a coding-benchmark score can spread from 42% to 78%. Swap six frontier models through the same harness, and the scores land within a point of each other.

Andrej Karpathy's LOOPS.md is nine rules for treating that loop as a first-class object. One principle runs through all of them: humans own the spec and the boundaries; the model owns the execution and the ledger. The planner doesn't touch the code, the generator doesn't grade itself, and state lives on disk, not in the context window. Let a model grade itself and sycophancy shows up more than half the time; ask the same question with a longer context and accuracy caves. The unit of leverage has moved from the prompt to the procedure.

We run a pipeline that picks its own topics before dawn and ships several pieces a night. This report holds those nine rules against that pipeline, rule by rule, and records honestly what we were already doing well and where last night exposed a hole. A company that builds autonomous data pipelines lives by the same principles inside its own publishing pipeline.

42% → 78%

Harness swing

Coding-benchmark score for one fixed model when only the harness changes (model swap: under 1 point)

58%

Self-grading sycophancy

Rate of sycophantic behavior when a model was asked to evaluate its own output (Fanous 2025)

99.3% → 69.7%

Context rot

Same question, accuracy as context grows from 1K to 32K tokens (NoLiMa)

61% → 25%

Collapse on stacking

Reliability on the same task attempted once vs. stacked eight times (τ-bench)

1

What Karpathy Killed: The Prompt Era

For a while, "one well-crafted prompt" was the unit of skill. Drop in a few examples, coax out a chain of thought, pin down the output format, and the model gave you what you wanted. Through that lens, when an agent fails, the first suspect is the model: it's weak, so it got things wrong. But the people who have actually run long-lived agents keep landing on the same conclusion. Most failures come not from the model but from the outer structure wrapped around it.

The benchmarks put a number on this. Several teams took a single coding benchmark, held the model fixed, and bolted on different harnesses. The top score spread from 42% to 78%. That 36-point gap came with no change to the model at all, only to the loop, the tool format, and the state management. Run it the other way—six frontier models cycled through one harness—and the spread narrows to within a single point. In one extreme case, a single change to an editing tool's format took one model from 6.7% to 68.3%, a tenfold jump; under some conditions, a weaker model on a good harness edged out a stronger model on a standard one.

The gain you buy by swapping the model and the gain you buy by fixing the harness are not the same size. Set that contrast on a single pair of bars and it looks like this.

One caveat. The absolute scores on this benchmark are disputed for contamination, so we can't take them at face value. But what we're reading isn't the absolute score—it's the difference "when only the harness changes on the same model." Even a contaminated dataset gives a valid relative comparison under identical conditions. And that relative comparison says one thing, consistently: the big block of leverage has moved from the model to the harness.

The model can write code, review it, and check its own output against a standard it agreed to ten minutes ago. What it can't do alone is decide when to stop, when to restart, and where the result should go. That's the loop's job—and the reason the procedure, not the prompt, became the unit of skill.

2

Nine Rules, Four Clusters

Read in sequence, the nine rules of LOOPS.md look scattered. Grouped into four clusters, they become a single sentence: split the roles, keep the state outside, turn quality into a contract, and keep shrinking the harness. Below, each rule is placed inside its cluster, with external evidence attached wherever the point is easy to assert on intuition alone.

First, an honest footnote about sources. The rule numbers and clusters here follow the v060726 revision, and some of the wording draws on documents the community has organized and carried forward. So we cite the spirit of the rules while naming the revision alongside.

2.1Split the Roles (Rules I–III)

Rule I — Write a loop, not a prompt. The unit of work is not a single message but a repeatable procedure. Gather, reason, act, verify, and repeat. A sturdy loop that runs ten times beats one perfect prompt.

Rule II — Separate the roles. Don't blur the planner that writes the spec, the generator that executes, and the evaluator that judges. Above all, never let the generator grade its own output. The reasons are measured. Ask a model to evaluate its own answer and sycophantic behavior appears 58.2% of the time (Fanous et al., 2025); push a false claim on it and, on average, 63.7% agree—14.7% of them flipping a correct answer to a wrong one (Wang et al., 2025). A separate finding shows evaluators favor their own outputs (Panickssery et al., NeurIPS 2024). The evaluator has to start from the stance that "the code is broken, and refuting it is my job."

Rule III — Negotiate the contract first. Pin down what "done" means before any code is generated, in a testable form. Write down—in numbers and conditions—what has to be present to pass and what has to be absent to fail, and every later loop can check its own output against that contract. When the contract is blurry, both restarting and verifying lose their meaning.

Open the path where the generator grades its own output and sycophancy shows up 58% of the time. How the three roles interlock is laid out below.

▲ The key to role separation: the judge must sit outside the generator | Pebblous original diagram

2.2Keep State Outside (Rules IV–V)

Rule IV — Write to disk, not to context. The context window rots as it grows. It gets summarized, compressed, and distorts the past. This isn't a hunch. GPT-4o answered the same question with 99.3% accuracy at 1K tokens of context and dropped to 69.7% at 32K (NoLiMa). When the relevant information sits in the middle of the context, accuracy falls 30–50%—a U-shaped degradation that showed up across all eighteen frontier models tested (Liu et al., 2024, "Lost in the Middle"). So keep state in files. Write progress, contracts, and logs to disk, and even after a crash there's something to resume from.

As context grows, the same model's same ability bends downward. Trace that decline as a curve and it runs like this.

Rule V — Let the loop restart. When a run goes sideways, throwing it out and starting from scratch is the best behavior of the latest models—better than patching a bad state. The evidence points the same way. A task that succeeded 61% of the time on a single attempt fell to 25% when attempts were stacked eight deep (τ-bench). Stacking doesn't improve things; it collapses them. With a sturdy evaluator and contract, a restart isn't a loss—it's a cleanup. Humans step in only when the contract itself is wrong.

2.3Turn Quality Into a Contract (Rules VI–VII)

Rule VI — Grade the subjective. Even taste—"good writing," "clean design"—can be made into a contract. Split it into axes like design, originality, polish, and function; assign weights; attach reference examples; and the model converges on "taste as written." Fail to put taste into words and the loop converges in some arbitrary direction.

Rule VII — Read the trace. Before running one more experiment, read the raw transcript of the run you already have. Where the reasoning went off the rails lives in the raw record, not in the summary. Only after reading the trace can you see where to tighten the prompt. This rule is the partner of Rule IV: you can only read the trace if state was left in files.

Without a direction to converge on, each run stops somewhere different. Here is how a vague preference becomes a testable scoring contract.

▲ Converting preference into a weighted scoring contract | Pebblous original diagram

2.4Shrink the Harness (Rules VIII–IX)

Rule VIII — Delete harness. When the model gets better, half of yesterday's scaffolding becomes debt. Code left in place to wrap a job the model now does for free carries only cost, and that cost isn't small. Agent tasks burn 1,000× the tokens of a code chat, and as context accumulates every step, cost grows O(N²). One company that adopted multi-agent setups saw its quarterly bill jump 340%. Rereading the harness with every new model and deleting the rules that are no longer needed is what Rule VIII is.

Rule IX — The bottleneck always moves. Once coding stops being the bottleneck, planning becomes it; then verification; then taste. The bottleneck yesterday's harness aimed at may no longer be there today. So harness design isn't a job you finish once—it's continuous work, chasing the next bottleneck as it moves. This rule is what keeps the other eight alive.

Read one at a time, the rules scatter into nine pieces of advice; set into four clusters, they fit on one page. The table below rebuilds the nine rules by cluster. Rather than memorizing the numbers in order, use it as a map to pin down which cluster your loop is currently missing.

Cluster	Rule	One-line gist
Split the roles	I	Write a loop, not a prompt
	II	Separate planner, generator, and evaluator (no self-grading)
	III	Pin "done" down as testable before writing code
Keep state outside	IV	Write to disk, not to context
Keep state outside	V	Let the loop restart instead of patching
Quality as contract	VI	Grade the subjective (taste) on weighted axes
Quality as contract	VII	Read the trace before running another experiment
Shrink the harness	VIII	Delete the harness the model now does for free
Shrink the harness	IX	Keep chasing the next bottleneck

Rule inventory per the LOOPS.md v060726 revision. Some wording draws on community-maintained documents.

3

The Overnight Layer: Loop → Loops → Routine

If the nine rules build one sturdy loop, the next question is "how many of those loops, and where do they run?" The answer grows in three layers: a loop that puts a single Claude on a schedule, loops that run many at once, and—once controls are in place—a routine that moves to a server and runs around the clock. This isn't a concept; it's a product landscape that already exists.

The best evidence is Karpathy's own experiment. His automated research loop ran 126 experiments in one night, and roughly 700 over two days. Distributed, it attached 35 agents and ran 910 experiments in 8 hours for $309. And that loop found a missing QK-Norm scalar bug he said he'd failed to catch for 20 years. The automated-research code itself was about 630 lines of Python and picked up 25,000 stars within five days of release. The result came not because the tool was special, but because the loop was left to run overnight.

The product landscape is maturing in the same direction. Claude Code supports schedules, background execution, and sub-agent orchestration; Cognition's Devin raised its autonomous PR-merge rate from 34% to 67% in eighteen months. Durable-execution engines that implement restart and checkpointing at industrial scale are Rule V made real. "A loop that runs overnight" isn't a Twitter meme—it's infrastructure you can buy today.

4

Field Test: Holding It to Our Own Pipeline

From here on, this is an operations log, not a summary. Every morning before dawn, we run a pipeline that picks its own topics, researches, drafts, verifies, and ships several pieces. We held the nine rules against that pipeline one at a time. The results mix pride and reckoning. Some rules we were already living; others exposed a hole last night. We won't name internal identifiers, but we'll record the lessons concretely.

In the table below, orange marks rules we were already living well; neutral gray marks rules where a hole showed up.

Rule	Where our pipeline stands	What we learned
IV Write to disk	Planning, research, and synthesis outputs are left as workspace files at every step; the next step starts by reading those files, not the context.	The rule we lived best. Because steps were joined by files, when one step died, work resumed from that point.
II Separate roles	Planning, writing, and verification agents are split; style verification is graded on a separate axis (11 tells), not by the writer.	Early on we let the writer review its own prose and kept getting burned by sycophantic self-evaluation. Detaching verification cut the slop.
III Contract first	SEO, structure, and field checks are pinned as pass conditions before publishing, so the draft checks itself against that contract.	With "done" written as numbers, even a restart converged on the same target.
V Let it restart	We once patched failed runs forward instead of discarding them. A bad state flowed into the next step.	A hole. Patched-forward runs broke more often, not less. If the contract wasn't wrong, cleaning up and restarting the failed run was the better move.
VIII Delete harness	The harness grew monotonically. Even after the model improved, we left old rules in place.	A hole. Still wrapping a job the new model does for free only added cost and maintenance burden. The harness has to be deleted regularly too.
VII Trace	Each run's logs are archived permanently with the piece; when something breaks, we read the raw log first, not the summary.	Only after reading the trace did we see the bottleneck was in verification, not writing.

The autonomy numbers have to be contextualized honestly. A mature pipeline's goal is to automate the boring 95% and leave humans only on the risky 5%. But that's a target, not an industry average. In a survey of 919 global leaders, 69% of agent decisions still passed through human verification, only 13% of organizations ran fully autonomously, and 40% of agent projects failed—with the lack of harness and infrastructure foundations named as a cause (Dynatrace, 2026). We report our pipeline's autonomy numbers only from our own operations log, and we don't blend them with these external generalizations. The 95% and 5% are the direction we aim at; the 69% is where the world averages today.

In our pipeline too, the bottleneck moved in exactly the order Karpathy foretold. First getting a draft out was hard; then planning; now verification and taste are the bottleneck. More important than the list of rules we did well and the rules that sprang a hole is the fact that the list keeps changing.

5

A Checklist for Your First Loop

Try to apply all nine rules at once and you'll never start anything. Your first loop should be boring, narrowly bounded, and verifiable in seconds—watching CI logs, rebasing PRs, a summary every morning. Run it for a few days, build trust, then expand. Below are eight questions to run through before you stand up that first loop.

1.Is the contract testable? Can "done" be split into pass/fail without human judgment? If not, write the contract before the loop.
2.Are the roles separated? Is the side that generates different from the side that judges? If they're the same, it converges to sycophancy.
3.Is state on disk? Is where-to-resume-after-a-crash written in a file? If it lives only in context, you'll lose it.
4.Is restarting cheap? Is discarding a failed run and starting over cheaper than patching it? That's what keeps Rule V alive.
5.Is the boundary narrow enough? The first loop does one boring thing. Save the ambition for your third loop.
6.Does the risky 5% stay with a human? Public publishing, merging to main, a first migration—keep an approval gate on them.
7.Are you ready to read the trace? When something breaks, is there a raw log to look at? If only summaries remain, you can't fix it.
8.Do you have a plan to delete? Decide in advance which harness you'll delete when the next model lands. A monotonically growing harness is one you've already stopped reading.

6

The Bottleneck Always Moves

What Karpathy killed wasn't the prompt itself, but the belief that the prompt is the final unit of skill. Leverage moved from the model to the harness, and inside the harness it doesn't stay in one place either. Solve coding and planning becomes the bottleneck; solve planning and verification does; solve verification and taste does. This is why the last of the nine rules is what keeps the rest alive. Harness engineering isn't the skill of catching a bottleneck once—it's the habit of following the bottleneck as it moves.

Even after yesterday's bottleneck is plugged, the next one has already moved to a new address. Lay that migration out on a timeline and it looks like this.

▲ Bottleneck migration timeline: solving each stage reveals the next | Pebblous original diagram

At this point we're not spectating someone else's future. Pebblous builds industrial AI, Physical AI, and autonomous data pipelines that gather, refine, verify, and iterate on data automatically. A pipeline that handles data automatically and a pipeline that plans, verifies, and publishes writing automatically share the same loop structure: gather, reason, act, verify, repeat. So the nine rules of LOOPS.md aren't "theory to consult someday"—they're an operating manual today.

In particular, three rules—keep state on disk (IV), grade the subjective as a contract (VI), and read the trace (VII)—are the very principles of data quality and reproducibility. Refusing to trust summarized memory and judging by raw output is the same stance as looking at source records rather than summary statistics when diagnosing data. Rule II, that self-grading converges to sycophancy, is the agent-side version of how the quality of training and generated data breaks down. We write this not to sell a product, but to test our own operations with those principles and disclose the holes. Selling a theory and testing yourself with a theory are different things.

So the conclusion returns to the opening sentence. It's the weak harness, not the weak model, that kills your agent. Next time your agent dies running overnight, look at the loop before you swap the model. Nine times out of ten, that's where the culprit is.

Pebblous Data Communication Team
July 1, 2026

R

References

Academic Papers

1.Fanous et al. (2025). Measuring Sycophancy in Large Language Models. arXiv. (58.2% sycophancy rate under self-evaluation; 14.7% answer reversals under user pushback)
2.Wang et al. (2025). False Belief Alignment in Large Language Models. arXiv. (63.7% average false-belief agreement across 7 model families; range: 46.6–95.1%)
3.Panickssery et al. (2024). LLM Evaluators Recognize and Favor Their Own Generations. NeurIPS 2024.
4.Liu, N. F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. (30–50% accuracy drop for mid-position information; U-shaped degradation across 18 frontier models)
5.Anonymous (2024). NoLiMa: Long-Context Evaluation Beyond Literal Matching. arXiv. (GPT-4o accuracy: 99.3% at 1K tokens → 69.7% at 32K tokens)
6.Yao, S. et al. (2024). τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv. (GPT-4o pass@1: 61% vs. pass@8: 25%)

Industry & Technology

7.Karpathy, A. (2026). LOOPS.md: Field Notes on Agents That Run for Days (v060726). (Nine rules for long-running agent loops. Wording draws on v060726 and community-maintained documents.)
8.Karpathy, A. (2026). nanochat/autoresearch — MIT open source. (126 experiments per overnight run; 910 experiments in 8 hours for $309 via 35 parallel agents; discovered a QK-Norm bug missed for 20 years)
9.Scale AI SEAL; Particula Tech; Digital Applied (2026). SWE-bench Verified & Pro Leaderboards and Harness Effect Analysis. (Same model, harness change only: 42%→78% swing; Grok Code Fast: 6.7%→68.3%)
10.Cognition AI (2026). Devin PR Merge-Rate Progress Report. (Devin PR merge rate: 34% → 67% over 18 months)

Data & Reports

11.METR (2025). Measuring AI Ability to Complete Long Tasks (Time Horizon 1.1). (Claude Opus 4.6: 50% reliability horizon ~12 hours; 80% reliability horizon 1h 10m. 50% time horizon doubles every 7 months.)
12.Dynatrace (2026). Pulse of Agentic AI 2026. N=919 global leaders. (69% of agent decisions require human verification; 13% fully autonomous organizations; 40% project failure rate)
13.Pebblous Data Communication Team (2026). Autonomous Publishing Pipeline Operations Log. Cited anonymously. (Field test by rule — Rules IV & VII: passing; Rules V & VIII: gaps identified)