Executive Summary
The frontier of AI writing is shifting. It is moving away from the skill of crafting good prompts and toward the ability to measure output quality and then use that measurement to fix the prompt automatically. Autoresearch, the pattern Andrej Karpathy proposed for optimizing machine-learning code, is disarmingly simple: inside a loop, keep only the changes that beat the current best. Ole Lehmann took that same pattern and applied it not to code but to Claude's content skills, reporting that he raised a copy-quality pass rate from 56% to 92%. One line he left behind is where this report starts: "Good Autoresearch depends not on a good prompt, but on a good eval."
Look at Pebblous's multi-agent pipeline through that lens and you find that we already run half of the evaluation function. ko-prose-humanizer, which scores Korean prose against eleven stylistic markers, and seo-check, which audits four layers of SEO, are in effect the Judge that Autoresearch describes. That evaluation function is not theoretical: across five articles in our AI-governance series, it cut em-dash overuse by 46.6% — a concrete case of measurement catching a defect. What is missing is the other half. We have no golden test set, no mutation engine to vary a skill's prompt, no score-based automatic rollback, and no meta-loop that improves the skill itself. Our pipeline enriches a single article beautifully, but the skill that produced it never gets better on its own.
The conclusion of this report is that the two systems are not competitors but complements. Autoresearch is narrow, deep automated optimization; Pebblous is broad, rich, one-pass generation with human judgment and multilingual rewriting layered on top. The path forward is not to remove the human checkpoints but to add quantitative scores at the points where humans already look. This report proposes that port, step by step, from an operator's seat.
56% → 92%
Autoresearch copy pass rate
4 rounds, ~$15; Ole Lehmann self-report (2026)
~+20pp
Higher binary-judgment agreement
LLM–human agreement when scales become pass/fail
-46.6%
em-dash reduction (5 articles)
Measured by Pebblous via ko-prose-humanizer
40–60%
Cost saved by model routing
Top tier only for heavy reasoning; lower tier for the rest
What Autoresearch Is: Treating the Prompt as a Production Loop
Autoresearch is not a heavy framework. The code Karpathy released on March 7, 2026 is 630 lines of Python, and the core idea fits in a sentence: an agent makes one change to the code, runs a time-boxed experiment, measures a validation metric, keeps the change if things improved and reverts it if they got worse — then repeats. Karpathy ran this loop 700 times over two days on a single GPU, found 20 genuine improvements, and, applied to a larger model, lifted training speed by 11%. The loop caught defects a human had missed by hand for two decades, such as a missing normalization step.
What made the loop catch on so quickly is that it worked beyond Karpathy's own experiments — in someone else's production code. Shopify's Tobi Lütke reported running the same pattern on his company's template engine, Liquid, nearly halving a rendering speed that humans had already tuned and cutting memory use substantially. The result kept only the changes that passed every unit test across hundreds of automated experiments. That is a second piece of evidence that, in real product code rather than a toy model, the loop found improvements seasoned developers had missed. So the next question follows naturally: what if the artifact were not code, but prose?
The direct source for this report is not the code. It is Ole Lehmann moving the same pattern to a content skill. He started from the fact that his landing-page copy skill failed 44% of its quality checks, then had an agent revise the skill's prompt, compare scores, and repeat keep-or-rollback unattended. After four rounds and roughly $15 in cost, the pass rate reached 92%. Three changes survived: a rule to put a concrete number or outcome in the headline, a banned-buzzword list, and — more powerful than any rule — a worked example dropped directly into the prompt. One change was rolled back: a rule that capped character count too tightly, because the copy thinned out and the call to action weakened.
One caveat to state plainly: the 56%-to-92% figure is Ole Lehmann's own self-report, with no independent reproduction and no disclosed test-set size. The pass rate measures how much copy clears a pre-filter that screens out low quality — not an actual lift in conversion. Cite it as a striking case, but do not treat it as a verified, general effect.
The Five Pieces That Make Up the Loop
Break the pattern into parts and five remain: Mutation, which proposes one candidate change at a time; Execution, which produces the artifact under that change; the Judge, which scores the artifact; the Orchestrator, which keeps the change if the score rises and reverts it if the score falls; and the Changelog, which records what was changed and why. The real product here is not the prompt — it is the Judge. If the Judge wobbles, the whole loop optimizes in the wrong direction.
That is why Ole's checklist design rules matter. Keep items as three to six binary questions: Does the headline contain a concrete number? Is it free of buzzwords? Is the call to action specific? Consistent judgment, not precise scoring, is the design principle. Too few items and you miss quality; more than six and the model games the item scores while real reader quality drops — overfitting. An evaluation function should be a floor that blocks low quality, not a ceiling that suppresses creativity.
The Academic Lineage: Not a New Invention, a Simplification
The discipline of "one change at a time, then keep or roll back by score" is a practical simplification of a line of work already validated in academia. DSPy, which compiles and optimizes prompts as declarative modules, raised accuracy by 32% on multi-hop reasoning and 45% on math reasoning. OPRO, where a meta-LLM iteratively generates new prompts; EvoPrompt, which evolves prompts through crossover and mutation; and TextGrad, which treats natural-language feedback like a gradient, all belong to the same family. Effect sizes range widely with the task — from 7% to 45% — and show up most strongly in structured reasoning. Ole's reported +36pp sits at the high end of that range, but it bears repeating that it remains a single self-report.
Anatomy of the Pebblous Pipeline: Multi-Agent Plus Human Confirmation
The process Pebblous uses to produce a deep-research report is not a single block of prompt but a pipeline of agents working in sequence. When a topic comes in, duplication and value are reviewed in parallel; a human confirms once; then planning begins. From there the work forks three ways to research academic papers, industry developments, and data simultaneously, and a synthesis stage binds those into one document. Writing, a second human review, five stages of quality reinforcement, English rewriting, SEO and social, and finally publishing follow in turn. The very article you are reading went through that process.
Two design choices are worth noting. The first is tiered model assignment. Only the reasoning-heavy nodes — planning, paper research, synthesis, writing — use a top-tier model, while collection, execution, and publishing are handed to a lower tier. The top and lower tiers differ by roughly fivefold in price per output token, yet the gap on coding benchmarks is only a few percentage points. So routing only the heavy-reasoning nodes to the top tier saves 40–60% versus a single model while holding planning quality steady; add caching and batching and the savings grow further. This is Pebblous's implementation of the Autoresearch idea: plan well once, then execute cheaply, many times.
The second is the human confirmation gate. Right after the initial review and right after the first draft, the pipeline stops and waits for human judgment at two points. Reading that as unfinished automation is a mistake. Pebblous's aim is not maximal automation but measurement-driven iteration that keeps human judgment as a core gate. The gate is not a weakness; it is intended design.
We Already Have a Judge
Inside the pipeline, two tools already run that correspond to Autoresearch's Judge. ko-prose-humanizer catches the tells of AI-written prose across eleven markers and scores them out of 110 — patterns like em-dash restatement, a monotony of nominal sentence endings, meta-announcements, and the contrived pivot to the company at the end of a piece. If the score crosses a threshold, it forces an automatic correction; if it falls in the passing band, it lets the text through. seo-check audits four layers — meta tags, OG and Twitter cards, JSON-LD schema, and Search Console — and everything must pass before the work moves on.
The proof that this evaluation function is more than talk lives in our own writing. When five articles in the AI-governance series were corrected with ko-prose-humanizer, the total em-dash count fell from 545 to 291 — a 46.6% drop. Meta-announcements, which appeared five to ten times per article, disappeared entirely, and the closing pivot to the company was split out into a separate editor's note. Measurement actually caught the defect. It is self-referential, but it is honest evidence.
Head-to-Head: A Symmetry of Philosophy and Architecture
Put the two systems on the same axes and what emerges is not a ranking but a difference in character. Autoresearch is a tool for automatically drilling deep into a single prompt; the Pebblous pipeline is a process for completing one article broadly with human judgment woven in. The table below lays out how the two diverge across eight axes.
| Axis | Autoresearch self-improving loop | Pebblous pipeline |
|---|---|---|
| Unit of improvement | The skill prompt (the evaluation function itself) | A single article (the artifact) |
| Evaluation method | Automatic scoring of a binary checklist | 11 stylistic markers + 4-layer SEO |
| Who iterates | An unattended loop (the agent) | Agent generation + human confirmation |
| Rollback | Automatic revert when the score drops | Manual revision based on human review |
| Multilingual | Single output (bound to the target) | Parallel KO/EN rewriting |
| Narrative quality | Strong on short-copy, sentence-level optimization | Strong on multi-stage narrative and contextual coherence |
| Human involvement | Minimal (only at design time) | Constant, via two confirmation gates |
| Reproducibility | Tracked via scores and a changelog | Tracked via run logs (no skill-improvement history) |
The difference between the two systems comes down to a single sentence: Autoresearch drills narrow and deep through automated optimization, while Pebblous completes one article broadly, layering human judgment and multiple languages onto a single pass. One is good at fixing the evaluation function itself; the other is good at carrying one long article through to the end, fitted to its context. The two are not competing over the same job.
What Each System Misses: The Other's Blind Spot
The real value of the comparison is in exposing each system's blind spot. Each one misses something precise, in the shadow of what it does well.
What Pebblous lacks
- • A golden test set of representative inputs
- • A mutation engine that varies skill prompts automatically
- • Automatic rollback when the score drops
- • A meta-loop that improves the skill itself
- • A methodology-level changelog (today's changelog is only a content log)
What Autoresearch lacks
- • The flow and completeness of a multi-stage narrative
- • Factual coherence and contextual judgment
- • Multilingual rewriting (localization, not literal translation)
- • The editor's final call
- • The reader experience a narrow checklist can't see
To put it plainly: Pebblous measures, but it does not use that measurement to fix the skill. We score the prose of a single article, but the step that takes that score and mutates and validates ko-prose-humanizer itself into a better version is empty. Conversely, Autoresearch optimizes a narrow checklist quickly, but the grain of a narrative or an error of fact that the checklist cannot catch will never come into view. Each empty space is precisely the other's strength.
The Path Forward: Porting a Self-Improving Loop into the Pipeline
If we already hold half of the evaluation function, the way to fill the other half is not to build from scratch but to close the loop around the judges we already have. No heavy infrastructure is needed — remember that Ole's case cost about $15. The five steps below add quantitative scores at the points where humans already look, without removing the human confirmation gates.
Step 1. Formalize the judges into binary evaluation functions
Rewrite ko-prose-humanizer, seo-check, and articles.json validation as binary checklists with clear pass/fail. The finding that switching from scales to binary raises LLM–human judgment agreement by about 20pp backs this choice.
Step 2. Build a golden input set
Pick 10–20 inputs from representative topics and existing articles and freeze them as a golden set. Every time a skill changes, re-score against this set to check regressively that a change which helped one article doesn't break another.
Step 3. Mutate one rule at a time
Change the skill prompt one rule at a time. Touch several rules at once and you can't tell which change moved the score. This single line is Autoresearch's simplest and most important discipline.
Step 4. Keep or roll back by golden-set score
After a mutation, keep the change if the golden-set score rises and revert it if it falls. At the confirmation gate, the human verifies the result of that automatic decision with scores and evidence, not gut feel.
Step 5. Accumulate a methodology changelog
Record what was changed and why, and how the score moved, separately from the content log. As this history builds, the pipeline becomes a system that can explain how it improved itself.
Cost and Failure Modes
The scale is small. You start not with a heavy platform but with a handful of binary checklists, 10–20 golden inputs, and the discipline of comparing scores after one change at a time. The failure modes, though, are clear. First, overfitting: only the items the evaluation function catches improve, and the articles all start to look alike. That is why the evaluation function must be a floor, and the golden set must be refreshed each quarter to prevent the criteria from aging. Second, judge reliability: LLM judgment agrees with humans in the low 80s on general tasks but drops to the mid-to-high 60s in specialist domains, and the verdict shifts when you merely reorder the answers. So the core items keep a human in the loop, and a correction that swaps order and averages is needed.
The market has agent frameworks that generate well and observability tools that evaluate well, and they sit apart. Evaluation tooling has already been elevated to strategic infrastructure: one such company was valued at $800 million, and another was acquired by a model lab. What makes Pebblous rare is that it runs the generation pipeline, its own evaluation function, and a human confirmation gate as a single body. Add a self-improving loop and it becomes a content process that gets better on its own, by measurement. And the same skeleton transfers directly to a data-quality pipeline: the discipline of measuring output, changing one thing at a time, and keeping or rolling back by score works as-is for synthetic-data generation and label-quality correction. The content evaluation function is simply swapped for a data-quality metric.
Editor's note. This report took the pipeline Pebblous operates as both the object of comparison and the object of improvement. Pebblous's data-quality work — AI-Ready Data, DataClinic, synthetic data — and its content pipeline share the same skeleton: measure the output and fix it inside a loop. We used our own system as the example not to promote it, but to show exactly how an operator diagnoses and fills the empty spaces in their own system.
References
Primary sources: the Autoresearch original and its application
- 1.Lehmann, O. (2026). "Karpathy's Autoresearch method for Claude Skills." Threads @itsolelehmann. (Original 56%→92% self-report.)
- 2.Karpathy, A. (2026). "autoresearch." GitHub (MIT License, released 2026-03-07).
- 3.Agent Cookbook. (2026). "How to 10x your Claude Skills using Karpathy's Autoresearch method." (Reproduction details.)
- 4.VentureBeat. (2026). "Andrej Karpathy's new open-source autoresearch." VentureBeat; cross-checked with Fortune and NextBigFuture (700 experiments, 11% improvement).
Academic: automatic prompt optimization and LLM-as-Judge
- 5.Yang, C. et al. (2023). "Large Language Models as Optimizers (OPRO)." arXiv.
- 6.Khattab, O. et al. (2023). "DSPy: Compiling Declarative Language Model Calls." Stanford NLP. (HotpotQA +32%, GSM-8K +45%.)
- 7.Zheng, L. et al. (2024). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS.
- 8.Evidently AI. "LLM-as-a-judge: a complete guide." (Binary vs. scale reliability, +20pp.)
- 9.Galileo AI. "LLM-as-a-Judge vs Human Evaluation." (General 80%+, specialist domains 60–68%.)
Industry, market, and tooling
- 10.Caylent. (2026). "Claude Haiku deep dive: cost, capabilities, and the multi-agent opportunity." (Tiered models, routing savings.)
- 11.Braintrust. (2025). "Best prompt evaluation tools 2025." (Eval-tool ecosystem; Series B $80M, $800M valuation.)
- 12.Precedence Research. "Generative AI in Content Creation Market." ($19.75B → $143.09B, CAGR 21.9%.)
Note: the 56%→92% figure is Ole Lehmann's single self-report, with no independent reproduction; every in-text citation marks it as a self-report. Model prices and performance are described by tier and price ratio to avoid naming a specific version. Some market sizes vary by source depending on the definition (broad vs. narrow scope).