Launch offer: the first 5 clients receive a free 2-hour virtual consultation.·3 client spots open this quarter.
← All posts

Building an Autonomous Coding-Agent Fleet

AI agentsClaude CodeEngineering

Most teams using AI to write code are still doing it one prompt at a time: a developer in the loop, an assistant in a side panel, a single thread of work. That’s useful. But it caps the leverage at one human’s attention.

The next step up is a fleet — multiple agents working in parallel, each owning a scoped task, coordinated by an orchestrator, with a human reviewing outcomes instead of keystrokes. I’ve built and run these. Here’s what actually matters when you do.

A fleet is an orchestration problem, not a model problem

The temptation is to treat this as “get a smarter model and let it run longer.” In practice, the model is rarely the bottleneck. The bottleneck is everything around it: how work is decomposed, how agents stay out of each other’s way, and how you know the output is correct without reading every line.

A workable fleet has three layers:

  • An orchestrator that breaks a goal into independent units of work and hands each to a worker. It holds the plan, not the implementation details.
  • Worker agents that each take one scoped task — a file, a feature, a migration target — and run it to completion in isolation.
  • A verification layer that checks each result before it’s allowed to land.

The orchestrator should stay thin. The more logic you push into deterministic code — what to fan out, what to retry, what to merge — the more predictable the whole system becomes. Let the agents handle the open-ended judgment; let the harness handle control flow.

Isolation is what makes parallelism safe

The first thing that breaks when you run agents in parallel is shared state. Two agents editing the same files, stepping on each other’s changes, producing a merge conflict neither of them understands.

The fix is boring and essential: give each agent its own workspace. A separate git worktree, a separate branch, a separate scratch directory. An agent should be able to fail, loop, or produce garbage without affecting any other agent’s work. When it finishes, you integrate deliberately — you don’t let agents integrate themselves into a shared tree.

This costs a little setup time per agent, and it’s worth it every time. Parallelism without isolation isn’t faster; it’s just a harder-to-debug version of serial work.

The verification layer is the part everyone underbuilds

Here’s the uncomfortable truth: a confident, well-written, completely wrong answer is the default failure mode of an AI agent. The plausibility of the output scales faster than its correctness.

So the question that decides whether a fleet is useful or dangerous is: how does work get verified before it lands?

A few patterns that hold up:

  • Adversarial verification. Don’t ask “is this right?” — ask a separate agent to refute the change, defaulting to rejection when uncertain. A claim that survives a skeptic is worth more than one that survives its author.
  • Diverse lenses. When something can fail in more than one way, give each verifier a distinct angle — correctness, security, does-it-actually-run — instead of running the same check three times.
  • Deterministic gates. Tests, type checks, and builds are non-negotiable. An agent’s self-report that “everything passes” is a hypothesis; the CI result is the evidence.

The rule I keep coming back to: an agent’s output is a proposal, never a fact, until something other than the agent that produced it has confirmed it.

Knowing when not to fan out

A fleet is the right tool when the work decomposes cleanly into independent pieces: auditing a large codebase, migrating dozens of call sites, reviewing a change across several dimensions, sweeping for a class of bug. The wins there are real and large.

It’s the wrong tool when the work is one tightly-coupled change where every part depends on every other part. Splitting that across agents just adds coordination overhead and integration risk. One capable agent — or one capable human — is faster.

Part of the engineering judgment is recognizing which kind of problem you have before you spin up ten agents to solve a one-agent task.

Where this is going

The teams getting real leverage out of AI right now aren’t the ones with the best prompts. They’re the ones who have built the system around the model — the decomposition, the isolation, the verification — so that they can trust the output enough to stop reviewing every line and start reviewing outcomes.

That’s the shift: from AI as a faster way to type, to AI as a fleet you supervise. It’s a meaningful amount of infrastructure to get right, and it’s where a lot of the near-term value in engineering organizations is going to come from.

If you’re trying to figure out where agents fit in your team’s workflow — or you’ve tried and hit the correctness wall — that’s exactly the kind of problem I help teams work through.