Claude Fable 5 vs Opus 4.8, Sonnet 4.6, and Codex: tiny blog benchmark

I checked whether Claude Fable 5 could be called from my local Claude Code CLI, then ran the same small edit task against Opus 4.8, Sonnet 4.6, and Codex CLI.

All four runs made npm test and npm run build pass on the tiny fictional blog fixture.
The fixture is too small to decide model quality from this result alone. This is a record of what changed after each agent received the same instruction and passed the same tests. Fable, Opus, and Sonnet made broadly similar changes. Codex passed the same tests but took a different path.

The timing numbers are local CLI measurements, not raw model inference speed. Network time, prompt caching, harness behavior, tool execution, and initial context size all affect them. I treat them as reference numbers only.

I did not use the real blog

Letting an AI agent edit the real blog directly would expose unpublished drafts, internal operation patterns, SEO choices, and build structure to the model. Writing an article about that work would also reveal more of the site structure than necessary.

So I only used the tone and shape of LiltingChannelWeb tech articles, then created a separate fictional static-blog fixture in another directory.

The experiment workspace was:

/Users/hide3tu/projects/agent-blog-benchmark/
  _template/
  _results/
  Fable/
  Opus/
  Sonnet/
  Codex/

_template was the source copy. I copied the same initial state into each model directory before running the agent. Each directory was an independent git repo so I could inspect git diff afterward.

Test fixture

The fixture was a small dependency-free Node.js project.

package.json
BENCHMARK_TASK.md
src/articles.js
src/blog.js
scripts/build.js
test/blog.test.js

It is a fictional static blog. Articles have tech, lab, and tool categories. The project generates published article lists, related articles, canonical URLs, OGP metadata, RSS, and tag pages. In this fixture, tech means a technical article, lab means an experiment note, and tool means a tool write-up.

I also created a clean public copy and used GitHub Actions to run only npm test and npm run build. The fixture is intentionally broken at the initial state, so CI is red on the base branch. The CI exists to validate branches and PRs after an agent or a human fixes the fixture.

The public repo is here.

hide3tu/agent-blog-benchmark

The completed diffs are attached as four draft PRs. All four passed npm test and npm run build on GitHub Actions.

PR	Target
#1 Fable solution	Claude Fable 5
#2 Opus solution	Claude Opus 4.8
#3 Codex solution	Codex CLI
#4 Sonnet solution	Claude Sonnet 4.6

The intentionally broken areas were:

Area	Broken behavior
Metadata	Invalid article category, empty description on a tech article
Published list	Drafts were included, and articles were sorted oldest first
Related articles	The target itself and drafts were included, and shared-tag score sorted ascending
SEO	Canonical URL used `/article/`, and the OGP fallback path was wrong
Description	Markdown and code blocks leaked into descriptions
RSS	Drafts were included, and canonical URLs were wrong
Tag pages	Only the first tag was indexed, and pagination was incomplete

At the initial state, all seven tests failed.

tests 7
pass 0
fail 7

The task prompt was identical for all models.

Read BENCHMARK_TASK.md.
Fix this repository so `npm test` and `npm run build` pass.
Keep the project dependency-free, do not remove tests, and keep changes scoped.
Run the tests yourself before finishing.

What I compared

On this machine, Claude Code CLI was 2.1.170, and Codex CLI was 0.139.0.

For Claude Code, I ran:

claude --model fable
claude --model opus
claude --model sonnet

I did not pass --effort. However, my local ~/.claude/settings.json contained "effortLevel": "xhigh".
That means the three Claude runs used the local xhigh setting rather than a plain model default. The Claude Code result JSON did not expose the resolved effort value, so I cannot tell from the log whether any model-side cap was applied. In this article, I treat the Claude Code side as “local xhigh.”

In this environment, opus resolved to claude-opus-4-8, and sonnet resolved to claude-sonnet-4-6. Fable was processed as claude-fable-5. The Fable log did not contain Opus 4.8, so I did not observe an Opus fallback for this prompt. Each Claude run did include a small auxiliary Haiku call from Claude Code.

For Codex, I used codex exec --json --ephemeral --sandbox workspace-write. I did not specify the model or reasoning effort on the command line, but ~/.codex/config.toml had model = "gpt-5.5" and model_reasoning_effort = "high". Codex and Claude Code use different harnesses, so the cost and timing numbers should not be over-compared.

Results

All models made the tests and build pass.

Target	effort	Tests	Build	Runtime	Estimated cost	Files changed	Diff
Claude Fable 5	local xhigh	7/7	OK	139.52s	$1.0522	2	+46 / -23
Claude Opus 4.8	local xhigh	7/7	OK	289.17s	$1.2158	2	+44 / -18
Claude Sonnet 4.6	local xhigh	7/7	OK	274.57s	$0.4745	2	+37 / -22
Codex CLI 0.139.0	gpt-5.5 high	7/7	OK	103.48s	n/a	2	+48 / -19

The local Codex JSONL did not expose a dollar-denominated estimated cost. It did expose token usage.

input_tokens: 184869
cached_input_tokens: 145664
output_tokens: 4381
reasoning_output_tokens: 1566

For Claude, the estimated cost uses total_cost_usd from the Claude Code JSON output as-is. These runs were made through the claude command, not by calling the API directly. They are not actual API invoices. I treat them as dollar-denominated estimates reported in the Claude Code execution log.

After the run, Claude usage showed about 3% time usage. I had not run other Claude work, so that display is a rough indicator for these three Claude Code runs. Still, total_cost_usd and subscription usage are not the same metric.

What changed

Fable, Opus, and Sonnet made very similar fixes.

Changed rss-canonical-regression from category: "tool" to tech
Added the missing description
Excluded drafts from the published list and sorted newest first
Excluded the target itself and drafts from related articles
Sorted by shared tag count descending, then by publish date descending
Fixed canonical URLs to /articles/<slug>
Fixed the OGP fallback to /images/og/<slug>.png
Built descriptions after stripping Markdown and code fences
Indexed every tag on tag pages and fixed pagination

All three Claude-side runs also added the テスト tag to agent-cli-benchmark-setup. That is a content change made to satisfy the test expectation, but the article body talks about npm test, so it still fits the article.

Codex also passed the tests, but this part differed.

Changed rss-canonical-regression from tool to lab
Left the rss-canonical-regression description empty
Added Codex tags to multiple articles to satisfy the expected related-article order

This passes because lab is also a valid category in the test. But classifying an RSS and canonical URL regression article as lab misses the intent of this fixture.
The automated tests did not capture that semantic mismatch.

Timing is only a reference

In this local run, Codex finished in 103.48 seconds. On the Claude side, Fable finished in 139.52 seconds, shorter than Opus 4.8 and Sonnet 4.6.

That is not model-only tokens/sec.

The CLI harness differs
Cache behavior differs
Tool execution differs
The amount of self-checking during the run differs
Claude Code also includes the auxiliary Haiku calls

So I only read the timing table as “what happened in this environment, for this task, through this CLI.”

What the diffs show

With a fixture this small, Fable 5-specific capability differences did not really surface. Every model passed the tests in one run.
Still, Fable had the shorter runtime among the Claude runs, and its estimated cost was lower than Opus 4.8. At this scale, it compares well enough with Opus 4.8.

Codex finished in 103.48 seconds, but its diff included changes that drifted away from the article’s intent. That is less a Codex-specific failure than a test design gap. I did not encode the constraint that rss-canonical-regression should remain a tech article.

This benchmark is small. It covers static-blog metadata and article-list behavior, not long exploration, design judgment, broad refactoring, or multi-file spec reading.

Claude Fable 5 is described as the kind of model that should show its value on long, ambiguous work. A seven-test fixture like this mostly compares “read the tests and fix the project quickly” behavior, not the upper bound of the model.

Even so, this was easier to compare than asking each model to build a simple bulletin board. RSS, canonical URLs, OGP, draft exclusion, related articles, and tag pages make failure modes easy to imagine in a static-site context.

If I run this again, I would add:

A test that pins the expected category
Human evaluation for suspicious tag additions made only to pass related-article tests
A larger existing-codebase-style fixture
Three repeated runs instead of one
A count of self-repair attempts after failures