In 'One Developer Is All You Need' the speedup was wait time, not 4x AI coding

At Itaú Unibanco, a major Brazilian bank, one engineer used four AI agents to finish in nine weeks what a four-person team had been scoped to deliver over 18 weeks. This is a case study posted to arXiv (“One Developer Is All You Need”, arXiv:2605.18461v2).

Read as a headline, the numbers look like “use AI and you’re 4x faster.” The actual content is nothing like that. The biggest reason it sped up wasn’t the AI’s coding speed — it was that the “person-to-person wait time” that inevitably builds up in a four-person team disappeared. And for this setup to hold, you need a veteran who knows the codebase inside out, so it isn’t something an ordinary team can copy as-is.

What was built

The target was a digital signature platform. The job was to add five new features to an environment with four microservices and one frontend already running in production. Rather than building from scratch, this was work inside an existing system, requiring integration with an external signature provider, the bank’s internal CI/CD, WCAG 2.1 AA accessibility standards, and compliance with financial regulations.

The work was handled by a staff engineer with 8 years of experience, 4 of them at the same bank. Someone who had both the structure of the codebase and the business rules in their head.

The speedup came from wait time, not coding

Five features and 25 user stories were completed in three sprints (nine weeks). The original estimate was six sprints (18 weeks) for a four-person team.

That said, the lead time for a single user story to go from request to production stayed at 15–20 days, and didn’t shrink even as throughput rose. The paper explains this as lead time being bound by the pre-release validation (homologation) window, accessibility sign-offs, and downstream hand-offs — not coding speed. What got shorter wasn’t coding speed; it was the hand-offs and wait time between team members.

In an ordinary four-person team, the person doing requirements, the one doing design review, the one implementing, the one doing security checks, and the one doing QA are all different people. Every time the work changes hands, a meeting gets scheduled, you wait for a free calendar slot, and review queues form. If someone only works two days a week, things stall for several days right there. This wait time piles up into 18 weeks.

When one engineer holds the spec and hands tasks to AI agents, this inter-stage wait shrinks from days to minutes. If there’s a question, they answer it on the spot; fixes go in on the spot too. Not having to coordinate calendars mattered more than typing on the keyboard faster.

Even if code gets written fast, downstream stages remain — QA approval, accessibility checks, final validation in a pre-production environment. In fact, the pre-release final validation was done by a human, and accessibility was signed off feature by feature by a dedicated specialist. “One person” means the human driving the implementation is one person; it doesn’t mean one person handled everything through the final pre-release checks.

The pace across sprints also differed. Sprint 1 was mostly repository creation and infrastructure setup, and zero features were completed. Sprint 2 finished two features, Sprint 3 finished three — accelerating in the back half. That’s because the specs and prompts written early could be reused later, and because the split between core and non-core work stabilized.

How the four AI agent roles were split

Instead of throwing everything at one giant agent, the work is split across four roles. There are three actual tools, with Devin covering two roles: specification and non-core implementation.

flowchart TD
    A[Human] --> B[StackSpot<br/>Requirements]
    A --> C[Devin<br/>Specification]
    A --> D[GitHub Copilot<br/>Core impl]
    A --> E[Devin<br/>Non-core impl]
    B --> F[Spec doc]
    C --> F
    F --> D
    F --> E
    D --> G[CI/CD + human review]
    E --> G

StackSpot organizes the user stories, and Devin writes the spec. Business logic and user-facing screens are driven by a human working alongside GitHub Copilot’s agent mode, while routine work like infrastructure setup, API integration, and message queues is handed to Devin for autonomous execution.

Rather than telling the AI “write the code” directly, a spec is made first. Design decisions, API contracts, acceptance criteria, and regulatory constraints are written down as a natural-language spec, and that becomes the agent’s input. The paper calls this approach Spec-Driven Development (SDD). The idea is to treat the natural-language spec, not the code, as the primary artifact, so its quality directly determines the quality of the AI’s output — close to extending test-first TDD further upstream. The idea is close to the sub-agent split in Claude Code’s multi-agent review, but here it’s the product-development roles themselves that are divided, not PR review.

When the spec is vague, the AI output breaks too

The numbers show 90% of AI-generated code was accepted on first review. But that’s the number “when the spec was specific enough.”

When the spec was vague, or when implicit rules between existing systems weren’t written into the spec, every tool produced unusable code — that’s what the paper says.

For example, when microservice A sends a request to B, what does A do if B returns an error? Retry, fall back, or just let it drop? Decisions like this are often settled by past incident response and aren’t written in any docs. They live only in the team’s memory or in the code. The AI doesn’t have this kind of tacit knowledge, so it fills in what isn’t in the spec by guessing — and gets it wrong. It’s the same structure as the article on LLM tool-use APIs.

All 113 integration tests and 65 E2E tests passed, with coverage over 90%. Zero defects were found after the production release; the single defect in the entire project was one accessibility bug found during the pre-release validation stage. The UI judgments that automated tests can’t catch stay with humans.

Why you can’t just copy it

Flip over the conditions that made this setup work, and you can see why it won’t reproduce in many places.

The engineer knew the codebase and the business inside out. 8 years of dev experience, 4 years at the same bank. Looking at code the AI produced, they could instantly notice “this exception handling differs from our internal pattern” or “this way of calling the API will break in production.” Someone with less experience can’t spot the holes in a spec, nor reject the AI’s “looks like it works but breaks in production” code. The reason a four-person team’s review function could be consolidated into one person is that the judgment of four people was in that one person — not that AI took over the review.

Most of the work rode on existing architecture patterns. CI/CD, non-functional requirements, and accessibility standards were already established, so there was little new to decide. Because it was an environment where you could tell the AI “write it following this pattern,” the 90% acceptance rate came out. When the product direction isn’t settled yet, domain knowledge hasn’t accumulated, or high-risk regulatory judgments are involved, you can’t move forward without humans discussing it before handing anything to the AI. In those situations, cutting team headcount itself becomes a risk.

The risk of consolidating into one person also remains as-is. Since all the project’s knowledge is in one person’s head, if that person takes time off or quits, everything stops. As a countermeasure, the paper lists keeping the spec, decision records, and agent configurations from day one. The authors put forward “a two-person technical pair plus a partially-allocated product owner” as a realistic configuration, but this is a proposal, not an experimental result.

The limits of a single case study

It’s one project, one bank, compared against the same team’s past track record, with no parallel control group. Because the researcher and the practitioner are the same person, confirmation bias creeps in too. The setup cost of the agent configurations, the internal procedural cost, and the long-term maintenance cost weren’t measured. The source code, internal specs, agent configurations, and measurement data are the bank’s confidential assets and aren’t published, so the study can’t be replicated.