Outcomes — Why Self-Grading Agents Are the Quality Control Layer Agentic AI Was Missing
Agentic AIAI QualityAI AgentsAnthropicEnterprise AI

Outcomes — Why Self-Grading Agents Are the Quality Control Layer Agentic AI Was Missing

T. Krause

An agent that does the work and also judges whether the work is good enough sounds redundant. It isn't. Anthropic's Outcomes feature separates doing from grading — and that separation is what turns an agent from something you have to check into something you can trust to check itself.

There's a failure mode that quietly limits how much real work gets handed to AI agents, and it isn't a lack of capability. It's that an agent doing a task is a poor judge of whether it did the task well. The same model, in the same session, that produced the work also evaluates the work — and it tends to evaluate it generously. The agent finishes, declares success, and is often genuinely wrong about that.

This is why so much agentic work still requires a human to check it. Not because the agent can't do the work, but because the agent can't reliably tell when its work is bad. The human isn't there to do the task. The human is there to be the quality gate the agent doesn't have.

Anthropic's Outcomes feature, announced at Code with Claude 2026, attacks exactly this. Outcomes uses a separate grading agent to score completed work and re-run tasks that fall short — and on internal benchmarks it lifted PowerPoint generation quality by 10.1%. The number is worth less than the structural idea behind it: separating the agent that does the work from the agent that judges it.

Why Self-Grading Fails When It's the Same Agent

The problem with an agent grading its own work is not laziness or a bug. It is structural, and naming the structure shows why a separate grader helps.

The doer is committed to its own choices. An agent that produced a piece of work made a chain of decisions to get there. Asked to evaluate the result, it evaluates from inside that chain — the same assumptions that produced the work now shape the judgment of it. A flawed assumption doesn't get caught, because the thing checking for flaws shares the assumption. You cannot fully audit a process from inside the process.

Doing and judging are different tasks that interfere. Producing work optimizes for completion. Evaluating work optimizes for finding fault. Asking one agent in one session to do both means neither runs cleanly — the completion drive bleeds into the evaluation, and "good enough to finish" quietly substitutes for "actually good." The mindsets are different, and merging them degrades the second one.

Self-assessment lacks an independent standard. When an agent grades its own output, the standard it grades against is the same standard it used while producing — there is no external reference. A separate grading agent can hold an independent rubric, one defined by what good output requires rather than by what this particular run happened to produce. Independence is the whole point.

What Outcomes Actually Does

Outcomes is best understood not as a smarter agent but as the addition of a missing role.

It introduces a dedicated grader. Outcomes uses a separate agent whose only job is to score completed work. That agent didn't do the task, isn't committed to the choices that produced it, and isn't carrying the completion drive. It approaches the work the way a reviewer does — looking for what's wrong, not for confirmation that it's done.

It closes the loop by re-running. Grading alone would just be measurement. Outcomes acts on the grade: work that falls short is re-run. That makes it a genuine quality-control loop — do, grade, and if the grade is low, do again. The agent system iterates toward an acceptable result instead of stopping at its first attempt.

It externalizes the standard. Because the grader is separate, the definition of "good" lives with the grader, not buried in the doer's session. That makes the quality bar inspectable and adjustable. You can see what the work is being held to, and you can change it — which is the foundation of trusting the loop.

It mirrors how human quality control already works. No serious organization lets the person who did the work be the sole judge of whether it's acceptable. Editors, reviewers, QA, audit — all are the same pattern: separate the doing from the judging. Outcomes is that long-proven pattern, applied to agents. Its plausibility comes from how unremarkable the idea is in every other context.

Where This Changes What You Can Delegate

A reliable grading loop changes the calculus of delegation, and it does so differently across kinds of work.

High-volume repetitive work. This is the clearest win. When an agent produces many similar outputs — reports, summaries, structured documents — checking each by hand defeats the point of delegating. A grading loop checks every item automatically and re-runs the weak ones. The human reviews the grader's exceptions, not the whole output. The PowerPoint benchmark gain is a concrete instance of exactly this.

Work with checkable quality criteria. Outcomes is strongest where "good" can be expressed as a standard the grader can apply — completeness, internal consistency, conformance to a format or rubric. The clearer the criteria, the more reliable the loop. Defining those criteria well is now part of the work.

Work where quality is genuinely subjective. Where "good" depends on taste, strategic fit, or context the grader doesn't have, a grading agent helps less. It can catch the objective failures, but the final judgment still needs a human. Honest deployment names this boundary rather than pretending the loop covers everything.

What to Actually Do About It

Getting value from grading loops takes deliberate setup, not just enabling a feature.

Invest in defining what "good" means. The grading loop is exactly as good as the standard it grades against. A vague rubric produces a grader that passes weak work. Spend real effort writing down what good output requires for each task — that artifact is now a core part of the system, not an afterthought.

Audit the grader, not just the doer. It's tempting to trust the loop and stop looking. Don't — at least not at first. Periodically check whether the grader's judgments match yours. A miscalibrated grader is more dangerous than no grader, because it manufactures false confidence at scale.

Move human review to the exceptions. Once you trust the loop, restructure the human role. Stop reviewing all output; review what the grader flagged, plus a sample of what it passed. This is the actual productivity gain — human attention spent on the uncertain cases, not the whole pile.

Keep humans on the subjective calls. Use the grading loop for the objective, checkable dimensions of quality and keep humans firmly on the dimensions that need taste, strategy, or context. Drawing that line explicitly is what makes the system trustworthy rather than quietly overreaching.

The reason agentic AI has needed a human at the end of nearly every workflow was never really that the agents couldn't do the work. It was that they couldn't be trusted to know when the work was bad. Outcomes addresses that with an idea no organization would find strange in any other setting: don't let the doer be the only judge. Separate the roles, give the judge an independent standard, and let the system iterate until the standard is met. It is a modest-sounding feature. But the gap between an agent you must check and an agent that checks itself is most of the distance between an impressive demo and a process you can actually hand over.

We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.