Tech 15 min read

LLMs Can't Be Lazy

IkesanContents

Bryan Cantrill (co-founder of Oxide Computer) published “The Peril of Laziness Lost” on April 12.

The thesis: “laziness,” one of Larry Wall’s three programmer virtues, is being lost as LLMs take over code generation. I read it and it hit hard, so here’s a breakdown plus some thought experiments on how to actually operationalize the idea.

Larry Wall’s “Laziness”

Larry Wall, creator of Perl, listed three virtues of a programmer in “Programming Perl” (the Camel Book).

  • Laziness: The motivation to design well now so your future self doesn’t suffer
  • Impatience: The intolerance for slow systems
  • Hubris: The pride that makes you write code you’re not embarrassed to show

Cantrill argues laziness is the deepest of the three.

Laziness isn’t slacking off. You don’t want to do the same thing twice. You hate the thought of future tedium. So you invest effort now to build the right abstraction for yourself and everyone who touches the code after you.

The irony is that being lazy requires enormous effort. Like Rich Hickey’s “hammock-driven development,” finding the optimal abstraction is intellectually brutal.

LLMs Don’t Know “Annoying”

Here’s the core of Cantrill’s argument.

LLMs have no time constraints. Writing code costs them nothing, so the motivation to “just consolidate this already” never arises. The result is endlessly layered code stacked without limit.

Cantrill cites Garry Tan (Y Combinator president) boasting about writing “37,000 lines of code in a day” with LLMs. For reference, the entire DTrace project is about 60,000 lines. Judging software by line count is like judging a novel by page count, Cantrill quips.

When Tan’s project was actually analyzed, the typical symptoms were there:

  • Duplicate test frameworks
  • Unnecessary starter templates
  • A text editor embedded for no apparent reason
  • Eight logo variations

Each problem is fixable, but the root goes deeper. LLMs only add. They don’t delete, consolidate, or minimize on their own.

This shows up in everyday tools too. An issue proving Claude Code quality regression through 17,871 thinking blocks showed that when thinking depth drops, the Read:Edit ratio collapses and full-file rewrites (Write operations) increase. LLMs prefer “rewrite everything” over “read and patch,” and that’s another driver of code bloat.

The “Writing Engine” and the “Deletion Engine”

From here I’m mixing in my own interpretation.

Organizing Cantrill’s argument, this structure emerges:

ActorCharacteristicsStrengths
LLMZero time cost, unlimited generationWriting, expanding, adding volume
HumanFinite time, bounded cognitive loadDeleting, consolidating, abstracting

LLMs are the “writing engine.” Humans are the “deletion engine.”

The key insight is that human time constraints are the source of good design. Because cognitive load has an upper bound, we reject verbose interfaces and pursue simple abstractions. Constraint produces quality. LLMs lack this constraint, so they can’t reach simplicity on their own.

So what does the flow look like when developing with LLM + human?

flowchart TD
    A[Human: Design structure] --> B[LLM: Generate implementation]
    B --> C[Human: Delete and consolidate]
    C --> D{Duplication or<br/>bloat?}
    D -->|Yes| E[Human: Update abstractions]
    E --> B
    D -->|No| F[Done]

    style A fill:#4a9eff,color:#fff
    style C fill:#4a9eff,color:#fff
    style E fill:#4a9eff,color:#fff
    style B fill:#ff6b6b,color:#fff

Step C, “delete and consolidate,” is the most critical. Skip it and stack LLM output directly, and you get the layer cake Cantrill warns about.

At first glance this reads as “use LLMs as tools and protect human laziness.” More precisely: “have LLMs do the tedious work so humans can focus on laziness (= design, deletion, abstraction).” Humans don’t get to be idle; they must do the intellectual work that enables idleness.

A rough division of labor:

Humans ownLLMs handle
Boundary definition (responsibility splits, interface design)Boilerplate generation
Detecting and removing duplicationExpanding uniform patterns (CRUD, DTOs, tests)
The decision to not writePrototyping and drafting
Simplifying the overall systemApplying known patterns

Three rules of thumb:

  • When the same logic appears twice, stop and take note
  • When it appears three times, abstract it
  • Read LLM code with the assumption you’ll rewrite it

The third one is quietly the most important. The moment you treat LLM output as a finished product, laziness is lost. Keeping the assumption that output is a draft, not a final form, is what Cantrill means by “remaining lazy.”

Can LLM Self-Review Work?

“Why not have the LLM review its own code?” is the natural next thought. It was a popular workflow for a while, and plenty of people still use it.

Google’s Linux kernel AI review system Sashiko caught 53.6% of known bugs. Meanwhile, CodeRabbit’s research (covered in design principles for production AI coding agents) shows AI-generated code has 1.7x more bugs than human code. The demand for review is real. The question is what kind of review.

Bug detection and type checking work well. “Correctness” checks are LLMs’ strong suit. But “simplicity” checks, i.e., preventing bloat, are weak out of the box. A human feels “ugh, I wrote the same thing twice, this will kill me later if I don’t consolidate.” That “ugh” is the starting point of abstraction. LLMs don’t have it.

Why It Breaks Down

Self-review stops working as the codebase grows.

First, bias toward own output. LLMs treat their generation flow as a given. Questions like “why this structure?” or “is this file split even correct?” or “is this abstraction unnecessary?” are weak. They check for bugs but won’t say “maybe none of this is needed.”

Next, context overflow. As code bloats, the model can’t hold the full picture, dependency graph, design intent, and near-identical-but-subtly-different logic all at once. Review degrades to surface-level linting (naming, null checks, thin exception handling), missing the things you actually want caught: design duplication, responsibility leaks, layer violations.

Finally, no “deletion” evaluation axis. Left alone, review becomes “improvement suggestions.” Add patterns, add safety nets, suggest generalization. Review that makes things bigger is self-defeating.

Reframe as “Deletion Audit”

Instead of general code review, restricting the goal to “deletion” dramatically improves accuracy.

Ineffective promptsEffective prompts
Review this codeList duplicate responsibilities
Point out problemsIdentify unnecessary abstractions
Suggest improvementsList classes that can be merged
Estimate how much code can be deleted while preserving spec
Suggest refactoring with “no additions allowed”

The “no additions allowed” constraint is critical. Adding it alone cuts down pointless abstraction proposals significantly.

Another important point: separate the generator and reviewer roles. When the same model in the same conversation self-reviews, it starts defending “past self.” Passing only the diff to a separate thread/instance is more effective at breaking assumption bias.

How to Scope Reviews

Feeding an entire large codebase kills the review. Scope it by concern.

ScopeWhat to examine
Diff reviewPR diff only. Most practical
Concern reviewAuth, DI, routing, DB access, etc. Cut by responsibility
Dependency boundary reviewIs Controller bypassing Service to touch Repository? Is Domain contaminated by Framework?
Metrics reviewTop files by line count, method length, import count, duplication rate. Extract with static analysis first

Don’t review all concerns at once. Splitting into passes (duplication, then responsibility violations, then exception handling, then test gaps, then deletion candidates) improves both accuracy and context management.

Metrics pair especially well with LLMs. Extract top-10 files by line count or import dependency count via static analysis, then ask the LLM “is this file’s responsibility appropriate?” or “should it be split?”

The Commonization Problem When Building From Scratch

Taking Cantrill’s thesis further: when you’re building a project from scratch with LLMs, how do you manage commonization?

Humans naturally think “wait, didn’t I write something like this before?” while coding. They run design awareness, duplication detection, commonization judgment, and deletion decisions in parallel. The “oh, this again” and “this is going to be annoying later” feeling is exactly the “laziness” from earlier.

LLMs are weak at this. They’ll keep writing even when things look similar. They don’t genuinely dread future maintenance.

But aiming for perfect abstraction from the start is also dangerous.

The Premature Commonization Trap

Forcing things into a single abstraction before requirement differences are clear leads to branch-laden messes.

BaseProcessor     <- unclear what it processes
CommonService     <- unclear what's common
AbstractHandler   <- unclear what it handles
UtilManager       <- nothing is clear anymore

If these are the only names you can come up with, commonization is premature.

Conversely, if a natural responsibility name emerges, the abstraction is sound.

TokenResolver          <- resolves auth tokens
RequestSigner          <- signs requests
ErrorResponseFormatter <- formats error responses

If the name doesn’t come naturally, it’s not time yet.

”Allow Two, Commonize at Three”

A well-known heuristic, but with LLMs you need to make it an explicit rule.

Humans unconsciously notice “oh, this again.” LLMs don’t. So you keep a “commonization candidate log” externally.

## Commonization candidates
- Auth token retrieval scattered across Controller A / B (2nd occurrence)
- API error formatting nearly identical in 3 places (-> consider commonizing)
- Pre-save timestamp normalization in multiple locations (2nd occurrence)
- External API retry duplicated per service (2nd occurrence)

You don’t have to fix it on the spot. Just record the detection and decide when the third occurrence hits.

Say you build a login API and a profile API, both reading the token:

// LoginController
const token = req.headers.authorization;

// ProfileController
const token = req.headers.authorization;

Two occurrences. Don’t commonize yet. Just note in COMMON_CANDIDATES.md: “Authorization header retrieval appearing in multiple controllers.”

When a third controller shows up, stop. Now consider commonization.

Good commonization looks like this:

class TokenResolver {
  resolve(req: Request): string | null {
    return req.headers.authorization ?? null;
  }
}

The responsibility name is natural. What it does is obvious from the name alone.

Bad commonization looks like this:

abstract class BaseAuthenticatedController {
  protected getToken(req: Request): string | null {
    return req.headers.authorization ?? null;
  }
  abstract handle(req: Request): Response;
}

LLMs tend to think “common logic means base class,” but this forces every controller into an inheritance hierarchy. Building an inheritance tree just for token retrieval is overkill, and it’s expensive to undo later. A human would feel “is this really worth inheriting over?” and cut a small function instead. LLMs lack that aversion to excess, so they’ll happily grow base classes.

Judgment criteria are simple. Commonize when: 3+ occurrences, you can explain the differences, a natural responsibility name exists, and it won’t create excessive branching. Hold off when: fewer than 2 occurrences, requirement differences aren’t settled, names come out vague, or base classes are the only way to unify.

Track by Responsibility, Not by File

Looking beyond code text to track duplication by processing category improves accuracy.

Input validation, auth/authorization, exception formatting, DTO conversion, pre-persistence processing, external API calls, logging, retry, caching, response formatting.

Keeping this shelf in mind surfaces responsibility duplication that file-name-based tracking misses.

Periodic “Deletion Rounds”

Deliberately separate building turns from reducing turns.

  • After building 2 features, do 1 cleanup round
  • After adding 500 lines, do 1 cleanup round
  • After adding 2 new Services, review boundaries

In deletion rounds, add no new features. Only look at duplication, responsibility leaks, similar interfaces, and process the commonization candidate log.

Skip this and you end up with “everything looks similar but everything is different” hell.

Minimum Viable Workflow

Here’s the minimum set to operationalize everything above with LLM agents.

I previously experimented with tmux + Claude Code + Codex to separate Manager/Worker/Reviewer roles. The optimization round went further into context management and meta-layer design. Revisiting that experiment through Cantrill’s “laziness” lens, what was missing was a mechanism to force deletion.

Three Files to Start

Create these three files.

FilePurpose
ARCHITECTURE.mdCurrent responsibility inventory. Per layer: what’s allowed, what’s forbidden
COMMON_CANDIDATES.mdCommonization candidate log. Locations, occurrence count, whether to commonize now, reasoning
RULES.mdRule set shown to LLMs every time. Layer violations forbidden, Base/Common/Utility forbidden, allow 2 / commonize at 3

These three files externalize human design intuition.

  • ARCHITECTURE.md: responsibility memory (“what lives where”)
  • COMMON_CANDIDATES.md: unease memory (“didn’t I see this before?”)
  • RULES.md: judgment guardrails (“what’s not allowed”)

Role Separation

flowchart LR
    M[Manager<br/>Design, accept/reject,<br/>deletion decisions] --> W[Worker<br/>Feature implementation]
    W --> R[Reviewer<br/>Deletion audit]
    R --> M

    style M fill:#4a9eff,color:#fff
    style W fill:#ff6b6b,color:#fff
    style R fill:#ffa94d,color:#fff

Worker implements one use case at a time. Don’t burden it with design decisions. After implementation, have it list “points that could be consolidated with existing responsibilities” and “potential commonization candidates.”

Reviewer is deletion-audit only. No new feature suggestions, no new large abstractions. Only point out duplication, unnecessary structures, and re-homing candidates.

Manager handles accept/reject, updates ARCHITECTURE.md and the commonization candidate log. This can be a human or an LLM, but if it’s an LLM, use a separate thread from Worker/Reviewer. The multi-agent PR review article analyzed the difference between sub-agents and orchestration; the same principle applies. Splitting by role across threads breaks assumption bias better than doing everything in one context.

One Turn Flow

  1. Manager gives Worker: ARCHITECTURE.md + RULES.md + use case requirements
  2. Worker returns implementation proposal with per-file responsibility notes, commonization candidates, open questions
  3. Manager doesn’t adopt directly; passes implementation + all 3 files to Reviewer
  4. Reviewer returns deletion audit: duplication, unnecessary structures, re-homing candidates, candidate log updates
  5. Manager makes final judgment
VerdictCriteria
AcceptNo responsibility violations, duplication at 2 or fewer, minimal new structures
ReviseSlightly wrong layer placement, processing that could be re-homed to existing responsibilities, unnecessary thin wrappers
RejectMany layer violations, proliferating classes with unclear responsibilities, ignored 3rd-occurrence commonization

Prompt Examples

When running Codex (Manager) and Claude Code (Worker/Reviewer) via tmux, what you write in the task file makes a significant difference. Here are minimal examples.

Worker implementation task:

## Task: Implement POST /api/auth/login

### Requirements
- Accept email, password
- Return token on auth success, 401 on failure

### References
- ARCHITECTURE.md (responsibility definitions)
- RULES.md (constraints)

### Constraints
- Consolidate with existing responsibilities when possible
- Minimize new classes
- Don't preempt future generalization
- Don't casually create Base/Common/Abstract/Utility

### After implementation, output:
1. Per-file responsibility description (one line each)
2. Points that could consolidate with existing responsibilities but were kept separate
3. Potential commonization candidates

The key is the “after implementation, output” section. Without it, Worker just writes and finishes. This forces verbalization of “wasn’t there something similar before?”

Deletion audit task:

## Task: Deletion audit

### Target
Worker's login API implementation proposal

### References
- ARCHITECTURE.md
- RULES.md
- COMMON_CANDIDATES.md

### Constraints (important)
- Suggesting new features is forbidden
- Suggesting new abstractions is forbidden
- Only output deletion and consolidation proposals
- "No additions allowed. Propose only reduction/consolidation of existing structures."

### Evaluation criteria
1. Are there duplicate responsibilities?
2. Are there unnecessary new classes?
3. Are there processes that should be re-homed to existing layers?
4. Are there items to add to COMMON_CANDIDATES.md?
5. Are there parts that should intentionally be left as-is? (with reasoning)

### Output format
Three-point sets: location, problem, deletion/consolidation proposal.

“No additions allowed” appears twice intentionally. LLMs tend to break constraints stated only once, so it’s reinforced in both the constraints section and the evaluation criteria.

The difference between these two prompts is the line between normal agent workflows and “preserving laziness” workflows. Tell Worker “list commonization candidates” and tell Reviewer “output only deletion proposals, no additions.” Without both, one side stacks things up while the other ignores it.

Common Failure Patterns

Base Classes Sprout

The TokenResolver vs BaseAuthenticatedController example above is exactly this. Explicitly forbid base classes in rules and only allow them with individual approval.

Utility Hell

AuthHelper
CommonUtils
AppHelper

Don’t create components whose responsibility you can’t name. Cut at a granularity where the name alone communicates what it does, like TokenResolver or ApiErrorFormatter.

Worker Pre-generalizes

“To make it reusable in the future,” Worker helpfully adds configurability and plugin mechanisms nobody asked for. LLMs are bad at anticipating future commonization yet love adding “generic-looking” things. Explicitly tell Worker every time: “Don’t preempt future generalization. Limit to the minimum required for this use case.”

Review Adds Instead of Removing

“Should add validation here.” “Exception handling is insufficient.” “Should add log output.”

The deletion audit has become an improvement proposal. Strongly specify in Reviewer’s prompt: “New feature suggestions forbidden. Deletion and consolidation proposals only.”

A close-to-home example: when I cut this site’s CLAUDE.md to less than half, it was the same structure. Adding information is easy, but deleting and organizing requires human judgment. When I told the LLM “improve the CLAUDE.md,” it just added more items. The decision to go from 273 lines to 121 never came from the LLM.


Cantrill’s conclusion distills to one line:

LLMs will certainly prove valuable, but as tools only.

LLMs are valuable as tools. But the moment humans stop being lazy, those tools don’t improve systems, they just inflate them. Trying to guarantee quality through self-review, or trying to feel safe by building frameworks upfront, both share the same root: humans dodging the responsibility to delete. LLMs can write endlessly. That’s exactly why humans must keep making the decisions to not write, to delete, and to consolidate.