index

How to Outship Teams 10x Your Size

The bottleneck in software engineering is no longer the typing. It’s the intent: knowing what to build, specifying it precisely enough that an agent can execute it, and verifying that the result actually works. A small team that understands this shift will outproduce a large team that doesn’t.

Some of the companies we compete with have engineering teams ten or a hundred times the size of ours. That used to be a disadvantage. It isn’t anymore.

Some of what follows will sound uncomfortable. Some of it should.

We run a factory floor now

In 1896, Sakichi Toyoda built a loom that could detect when a thread broke and stop itself automatically. This was not a faster loom. It was a different kind of loom: one that separated the act of production from the act of quality judgment. A single worker could now oversee dozens of machines instead of watching one, because the machines would signal when they needed human attention. Toyoda called the principle jidoka, sometimes translated as “autonomation,” automation with a human touch.

Taiichi Ohno later made jidoka one of the two pillars of the Toyota Production System (alongside just-in-time manufacturing), and the insight underneath it is worth stating precisely: the highest-leverage thing a floor manager does is not pick up a wrench. It is spot the bottleneck, reroute work, and keep the machines running. The wrench work is important. But the moment you confuse the wrench work for the job, you have misidentified where value is created.

I think software engineering is going through this transition right now, and most teams haven’t noticed.

When a coding agent, Claude Code, Cursor, whatever you prefer, can produce a correct, idiomatic, test-passing implementation of a well-specified task in minutes, the act of typing code is no longer the bottleneck. It is production work. Important production work, the way Toyoda’s looms still needed to weave thread, but production work nonetheless. The bottleneck has moved upstream, to the specification of what to build and downstream, to the verification of whether it actually worked. The person who can run five agent sessions in parallel, specifying tasks clearly, reviewing output critically, and routing work around blockers, will outproduce the person who writes beautiful code one function at a time. Not because they’re a better engineer in any traditional sense, but because they’ve correctly identified where the constraint is.

Running multiple agent sessions at once is a skill. It takes practice and it’s hard. But that is the skill we are building now, and it is what will separate a team of ten from a team of a one hundred.

Each of us owns a domain, not a layer

The traditional way to organise an engineering team is by technical specialty. Frontend engineers, backend engineers, infrastructure engineers, each defined by the layer of the stack they inhabit. This made sense when the scarce resource was deep expertise in a particular technology. If your React specialist is the only person who can build the payment form, you need them focused on React.

Agents change the equation. The agent is the specialist who knows TypeScript and Python and SQL and whatever else the task requires. It has read the docs more recently than you have. (It has, in fact, read all the docs, which is more than can be said for most of us.) What the agent cannot do is decide what to build, why it matters, and whether the result actually serves the business need. That requires understanding the domain: the payments flow, the search experience, the integration surface, the user’s actual problem.

So we’ve restructured around domains, not layers. Each engineer owns a business domain end-to-end. They write the spec, run the agents, review the output, and ship it. The same person who decides “we need retry logic on the webhook handler” also verifies that the retry logic behaves correctly in production. The feedback loop is one person wide, and that is the point.

The architecture that makes this work is boundaries. The only thing that matters between my domain and yours is the contract: the API, the types, the interface. Inside my domain, I move however I want. I can refactor freely, change implementations, let agents restructure entire modules. But the boundary between my domain and yours gets reviewed by both of us, because that’s where integration risk lives.

No cross-domain imports. No reaching into another domain’s database. These rules sound restrictive, and they are. They are also what makes it possible for ten people (and their agents) to move fast without constantly breaking each other’s work. Constraints that enable speed are not restrictions. They are infrastructure.

Platform engineers as force multipliers

Two or three of us ship zero features. This is deliberate. Their job is to make everyone else dramatically faster.

Isolated environments so agents can boot and validate the app per change, CI that runs fast enough to keep agents unblocked, agent-first repository knowledge and observability tooling. These are the jigs and fixtures of our factory floor. A factory without good tooling is just a room full of expensive machines producing inconsistent output. A factory with great tooling produces consistent output almost regardless of who’s operating the machine, which is precisely the property you want when the “operator” is an LLM.

When an agent produces bad output, the response is never “try harder.” It is “what guardrail is missing, and how do we make it enforceable?” That question is the platform team’s entire job.

This is the poka-yoke principle from Toyota’s system: mistake-proofing not through diligence but through design. If a standard isn’t enforced in CI, it does not exist. Coverage thresholds, import-boundary linters, complexity limits, architecture tests: these are the automated gates that agents cannot bypass. Agents are remarkably compliant. They will follow every guardrail you set with perfect consistency. They will also produce confidently wrong output if you set no guardrails at all. The agents won’t raise the bar for themselves. The platform team raises it for everyone.

This is where the jidoka parallel is sharpest. Toyoda’s loom didn’t just weave faster. It detected its own defects and stopped. Our CI pipeline doesn’t just build faster. It detects violations of our engineering standards and blocks the deploy. The human isn’t watching every thread. The human designed the machine to watch them.

We review intent. Agents review agents.

Here is the part that makes experienced engineers most uncomfortable.

Code review as we’ve practiced it for the past two decades is a particular ritual: one human reads every line of another human’s diff, leaves comments about naming and edge cases and architectural concerns, and eventually approves. It is a ritual built for a world where humans produce the code and the primary quality mechanism is another human’s careful attention. It works. I’ve spent years doing it and believing in it.

It does not scale when agents produce ten times the output.

The instinct is to say “then we need to review ten times as carefully,” and this is the wrong response. Not because careful review is bad, but because it misidentifies where human review adds the most value. An agent reviewer can catch style inconsistencies, simple bugs, and convention violations at least as well as a human reviewer and quite a lot faster. What it cannot catch is the subtle design flaw, the wrong abstraction, the implementation that technically works but solves the wrong problem. Those require understanding the intent behind the change, and that is a human judgment.

So we’re splitting the review into two parts. Before work starts, we review the spec: is this the right thing to build? Is the task well-defined enough for an agent to execute? Are the acceptance criteria clear? After work completes, we verify behaviour: does the change do what the spec said? Does it integrate correctly? Did the production metrics move in the right direction? The diff in between gets agent review and CI. The human attention goes where only human attention helps.

This is not removing humans from review. It is focusing human review on the two things that actually require humans: intent and outcome. The mechanical middle, “did the code correctly implement the spec,” is increasingly automatable. Insisting that a human eyeball every line is not rigour. It is a failure to distinguish between the parts of the process that need judgment and the parts that need consistency, and consistency is what machines are for.

This should feel uncomfortable

This model asks people to give up activities that have been central to their professional identity.

Nobody writing code by hand. Shipping without line-by-line human code review. Some of us not opening an editor for days. These feel dangerous. Our instincts say this is reckless. Our instincts were built for a world where typing speed was a meaningful factor in engineering output, and that world is ending faster than our instincts can update.

The discomfort is real. I feel it too. But notice what we have more of in this model, not less: more automated quality gates, more architectural boundary enforcement, more verification of production behaviour, more explicit specification of intent before work starts. The guardrails haven’t been removed. They’ve been moved from manual processes (which are inconsistent, tiring, and don’t scale) to automated systems (which are consistent, tireless, and scale with the number of agents you can run).

Shigeo Shingo, who formalised much of Toyota’s production theory, described autonomation as “pre-automation”: not full autonomy, but the stage where machines handle production and signal humans when judgment is needed. He identified twenty-three stages between purely manual work and full automation, and argued that ninety percent of the benefits come from autonomation alone, well before you reach the end of the spectrum. I think software engineering is somewhere around stage four or five of that progression. We are not replacing engineers. We are changing what “engineering” means, from production to judgment, from typing to directing, from reviewing lines to reviewing intent.

The goal is not that each of us does less. It is that each of us does more, reaches further, has more impact than we ever could typing alone. Ten people running a well-tooled factory, each overseeing multiple agents executing against clear specs, with automated quality gates that enforce standards the agents alone would never set, shipping verified, production-tested changes across their entire domain.

There is one piece of this model we haven’t built yet. A factory that produces ten times the output also produces ten times the potential for things to break in production. If your engineering model depends on agents producing the code, it cannot also depend on humans manually investigating every incident, correlating every alert, and remembering which fix worked last time. The incident management loop needs to be as automated as the production loop, or the production loop collapses under its own throughput. That’s the problem we’re working on at Phoebe.

That is how you outship a team ten times your size. Not by typing faster. By recognising that the typing was never the bottleneck.