WAVES: workers, aggregate, verify, extend

WAVES: workers, aggregate, verify, extend

An agent's answer is a claim until you check it

An AI agent will always tell you the job is done. WAVES takes that report as a claim and checks the evidence behind it before it counts. You split the work across a team of agents, each one owns a slice, and nothing an agent returns is trusted until it holds up.
notion image
A worker hands back a claim. The claim goes through a check. What holds up becomes a verified finding, and what does not goes back.
WAVE is the shape: Workers, Aggregate, Verify, Extend. You fan out workers across independent slices, aggregate the structured results they return, verify the evidence behind them, and extend into another wave when the work calls for one.
notion image
One bounded round: fan out, aggregate the handoffs, verify the evidence, then decide whether to run another wave or deliver.
It ships as a skill tuned for Cursor and for Codex, and the method carries to Claude, Droid, and any agent you run.

Where this comes from

notion image
One agent, then a harness with a loop, then parallel orchestration, then waves.
It helps to see the path, because each step solved the problem the last one left behind.
It started with one agent: you ask, it answers, and that is it. Then the agent got a harness, a model in a loop with tools, so it could read, act, look at the result, and go again. State moved out to files and git, which let a run keep going past a single context. That is where the loop techniques came from, a way to run long autonomous tasks while sidestepping context rot, with a test deciding whether to keep going. Loops are a real tool and they are good at what they do.
Then people wanted speed, so they fanned the work out to many agents at once and merged the results. That is parallel orchestration, and it is the part most people picture when they hear "agents in parallel."
A wave keeps that fan-out and changes what happens when the work comes back. The merge stops being plumbing and becomes the main event, because every result gets checked before it counts. That check is the whole difference, and the rest of this is about it.

Verification is the mode

The move that defines WAVES is treating every worker's result as a claim, and then checking it. What checking means is up to the agent driving the wave. It can run a test or a script. It can launch a browser and drive the actual UI. It can recount a number straight from the source. It can cross-check one worker's claim against another's. For the claims that carry the most weight, it can hand the claim and its sources to a separate verifier whose only job is to check.
notion image
The agent driving the wave chooses the check that fits the claim.
This is what makes it a mode and not a trick. The agents still explore freely, because language models are good at exploring and they are not deterministic machines. The discipline lives at the boundary, where a claim either becomes a verified finding or goes back.

The handoff is a claim

Every worker returns one structured handoff, and that handoff is the only thing the wave reads back from it. The prompt that goes out is a contract, because a worker cannot ask a follow-up question:
Goal (context only): the overall goal, so the worker can orient Your slice: the one disjoint range, area, or set of paths you own Where to look: dirs, files, data ranges, which tools to use Return: the handoff below, and nothing else Self-verify: cite or drop every claim, tag confidence, say what you could not verify Out of scope: what a sibling worker owns, so you do not overlap
What comes back has a fixed shape, so many isolated workers can converge without ever talking to each other:
## Status success | partial | blocked ## Coverage - Read: 388/388 rows - Skipped: none ## Key findings - [high] sponsor logic lives in webhook.ts:42 — evidence: webhook.ts:42 — sources: 2 - [med] pricing claim — evidence: <docs URL> ## Confidence & verification - Verified (re-ran / cross-checked / recounted): which findings, and how - Single-sourced / unresolved: which findings I could not confirm ## Open questions / Suggested follow-ups - candidate tasks for the next wave
A success at the top counts for nothing on its own. Each finding carries its evidence inline, a file and line or a URL, and a confidence tag. For the claims that matter, a verifier gets the claim and its sources, and not the original worker's reasoning, so it cannot inherit the same mistake and wave it through. It returns a verdict per claim:
## Verdict per claim - <claim> → supported | partly | unsupported | source-not-found - evidence: the quote, file:line, or metric that settles it ## Overall accept | revise | reject
notion image
The worker returns one handoff. The orchestrator reads it and routes the high-stakes claims to a verifier.

What I saw running Fable

I tuned this watching the Fable model run inside the Claude Code agentic harness, and the thing that stuck with me is where it chose to spend its compute. Most of it went to checking, not to producing the answer.
Its strongest habit showed up before any work fanned out. It would check the decomposition first: print the counts, confirm the slices summed to the total, and catch a missing chunk while it was still cheap to fix. A missing chunk is a silent blind spot, and it caught them up front.
It also would not call a deliverable done until it ran it. It would curl the thing locally and in production, check the status and the title, and regression-check the routes around it. And it re-read its own writes instead of assuming they landed, because skipping that is how you get an edit-before-read failure. Those habits are in the skill now.

Spend the check where it counts

You do not check everything the same amount. You check hardest where being wrong would cost the most, and you let the cheap, already-agreed findings through. Checking a claim costs far less than producing it, so this is where the budget should go.
notion image
Findings sort into tiers by stakes. Most are cheap, and the budget lands on the few that matter.
So the findings sort into tiers. A low stakes one that two workers already agree on gets accepted as is. A medium one gets a single verifier. A high stakes one gets a panel of models, judged and reconciled into one answer. A contested one with no clear ground truth goes to a short debate, where the facts both sides quietly drop are usually the false ones. Most findings sit at the cheap end, so the real budget lands on the small share the whole deliverable rests on, usually around a fifth of them.
The check is also what ends a wave and starts the next. When the verified findings are in and a fresh wave turns up nothing new, the wave is done. When the check exposes a gap or a conflict, that becomes the next wave, and it spends the findings you already trust. The verifier is the stop function, and it is the steering.

Where a wave fits

Two things make the fan-out worth it.
The first is context. Chroma's research on context rot tested eighteen frontier models and found every one grows less reliable as its input gets longer, on simple tasks, and well before the window is full. A single agent carrying a long job piles up everything it reads, and the quality drifts as the pile grows, so a rough patch early gets carried forward and compounds. A wave keeps each worker in its own small context and brings back a verified finding, so a bad slice gets caught at the boundary.
The second is the shape of the work. A loop fits work that converges on a clear check, like a bug with a failing test or a migration with a known target. A wave fits research, analysis, audits, and multi-stream work, where no single test says "done" and the verifier carries the weight a test would.
A wave is also more than parallel orchestration. Standard fan-out and fan-in is built for speed: split the work, run the pieces at once, and merge the outputs with a reducer. The fan-out is the easy part. A wave makes the merge the main event. You reconcile conflicts rather than averaging them, you carry each claim's confidence forward, and you bound the run so it stays checkable.

Wave shapes

A wave is not one move, and a few shapes cover most of what you will send it.
notion image
Pick the shape that matches the problem you are sending in.
When the problem is unmapped, a broad first wave goes wide to find the edges. The next wave narrows, killing the dead ends and going deep where there is signal, spending the last wave’s verified findings. When you need a shared foundation, a small wave writes a spec, you verify it, and a bigger wave builds against it. When you want coverage and a cross-check, you fan workers out in different directions on the same question and verify across what they bring back.
A wave is bounded on purpose. Three to eight workers, sized so you can verify all of them. Two to three waves deep. Around sixty percent of the effort goes to generating and forty to verifying, because picking the right answer is the scarce part.

Why this is where things are heading

You cannot make a model smarter at inference time. The training is done, the weights are fixed, and you do not get to touch the data that shaped them. Everything you can change happens at inference: how you frame and split the work, and how you check what comes back.
That is why verification is where the gains are. A check is something you can add right now, without retraining anything, and it pays off because checking a claim is so much cheaper than producing it. It is also why Anthropic and others lean on verification inside their own systems, with rubric judges, citation agents that re-attribute every claim to a source, and panels of models that check each other. The model generates, a separate pass checks, and the checking is where a lot of the reliability comes from. WAVES is that idea turned into a workflow you can run today.

The takeaway

A wave is one tool among several, and the shape of the work decides when to reach for it. Reach for a wave when a goal splits into independent pieces and a wrong answer would cost you enough that you want to check before you trust.
Everything in a wave comes back to one line: a claim is not evidence. The agents explore, you verify at the boundary, and what survives the check is what you build on. That boundary is the whole method.

Further reading

The thinking here builds on work I kept coming back to:
  • SAFE (fact-by-fact verification against search)

Get started

Cursor:
npx skills add https://github.com/RayFernando1337/rayfernando-skills/tree/main/plugins/waves/skills/waves -a cursor
Codex:
codex plugin marketplace add RayFernando1337/rayfernando-skills codex plugin add waves-codex@rayfernando-skills
Claude Code: /plugin install waves@rayfernando-skills, or /plugin install waves-codex@rayfernando-skills for the Codex tuned version.
Run it with /waves. It is opt in, because a run spawns more agents than a normal task. Point it at anything that splits into independent pieces, like researching several options to build a roadmap, or auditing a repo.
I have shipped the skill and I have not written evals yet. Use it on real work, and tell me where it holds and where it breaks. Evals and token efficiency come next, and the method sharpens the more real work runs through it.
A quick thank you to the Cursor team. A lot of the research credits behind this came from them, and I have been running waves in their agentic harness across my own apps to test the idea against real work. It has held up well, and I am glad to give them the shout-out.