AI slop detector from Claude Code edit logs: ModernBERT-ja, 0 false positives

The tech posts on this blog get written in all kinds of setups: by me, by Claude, by Codex.
Every post goes through a style check before publishing, but AI-slop phrasing still slips through every time.
The detection rules are written down, yet no matter how hard the LLM is told to watch for them, it keeps producing AI-sounding sentences and keeps skipping the ones that need fixing.

The idea: train a small encoder model to do detection only, and put it in front of the LLM scan.
Sentence-level classification is exactly what BERT-family models are good at, and no generation is needed.

How many sentences does training take

Supervised fine-tuning of an encoder needs on the order of 1,000 slop sentences even for binary classification.
My verified corpus (misses I had personally flagged and confirmed) held 21 entries.

The next candidate was the git commits from publishing posts.
The rewrite diff at publish time keeps pairs of the flagged slop sentence and its fixed version.
Counting them: out of 213 publish commits, only 21 rewrote an existing post by 30+ lines.
And those diffs mix in content additions and link edits, with style fixes making up only about 40%.
Clean pairs would top out at 150–300 sentences, an order of magnitude short.

Building a corpus from the conversation logs

Claude Code stores conversations as JSONL under ~/.claude/projects/.
Article fixes go through the Edit tool, so old_string (before) and new_string (after) sit in the history inside each tool call.
The “flag → fix” pairs are recorded before they ever reach git.

# 会話履歴JSONLからEditツール呼び出しを抽出する骨子
for b in record["message"]["content"]:
    if b.get("type") == "tool_use" and b.get("name") == "Edit":
        inp = b["input"]
        if "src/content/articles" in inp["file_path"]:
            pairs.append((inp["old_string"], inp["new_string"]))

The result from two machines.

Machine	Edit pairs	Posts covered
Mac mini	823	50
MacBook Pro	737	17
Total	1,560	67

The conversation logs hold more than five times what the git diffs could provide.
Not all of it is style work; content fixes and link edits are mixed in, and the style ratio gets sorted out later.

One catch: Claude Code deletes history after 30 days by default.
What I pulled covers only the last month; everything older is gone.
I set cleanupPeriodDays to 365 so future history survives.

As a bonus, I extracted 36 moments where my own typed messages called out style problems.
Comments like “the intransitive verbs and word choices all smell like AI” — checking them against the rulebook showed every recent one had already been converted into a detection rule.
That confirmed, from the opposite direction, that no feedback had been lost on its way into the rules.

Scanning 622 published posts

The style-check rules grew over months of use, which means early posts only ever faced the weak early versions.
Re-reading them with today’s definitions surfaces undetected slop sentences as candidates.

I wrote a batch that feeds Claude Code headless mode (claude -p --model sonnet) one prompt per post: three detection-definition files plus the article body with line numbers.
About 51 seconds per post, across the 622 tech posts published before June 2026.

Detection only; the batch never edits an article.
The output is unverified Sonnet-labeled data, so it lives in a separate directory from my human-confirmed corpus.

At post 204 the plan quota ran out and the remaining 418 came back as empty errors.
The script also had a bug that saved error results as “done”, so I changed it to skip saving errors, stop after 5 consecutive failures, and retry on a one-hour loop.

Final numbers after the full run: 7,208 detections, 7,133 unique sentences, 2,493 rated strong.
15 of the 622 posts came back with zero detections, so it does not just flag everything.

The category distribution put “conditional clause landing on a generalized conclusion” alone at 23%, in first place.
That pattern had been added to the definitions only days earlier, after I flagged it to Claude — and the numbers now show it was Claude’s single most common writing habit across the whole archive.
”English jargon left untranslated” follows at 17%, “closing on an evaluative verb” at 12%, and the top five patterns cover 67%.

The human class: my old WordPress diary

The classifier also needs sentences that are certain to be human-written.

A SQL dump of the WordPress diary I kept from 2011 to 2018 was still around.
Parsing the phpMyAdmin-style INSERT statements produced 23 posts, about 20,000 characters, roughly 600 sentences.
All of it predates Claude, so the human class has zero risk of contamination.

flowchart TD
    A[Conversation JSONL<br/>2 machines] -->|Edit pair extraction| P[Before/after pairs 1,560]
    B[622 published posts] -->|Sonnet batch scan| H[Machine-labeled slop candidates]
    C[WordPress diary SQL] -->|INSERT parsing| N[Human sentences ~600]
    D[Verified corpus] --> V[Type-labeled examples 21]
    P --> T[Training dataset]
    H --> T
    N --> T
    V --> T

One caveat.
The WP diary is all diary-register writing, while the slop sentences are all technical-register. Paired as-is, there is no telling whether the classifier learned “slop vs natural” or “diary vs tech article”.
To cancel the genre skew, the natural class also gets post-fix sentences from published tech posts: human-approved technical-register text.

Sorting 1,560 edit pairs into style vs content

The Edit pairs mix style fixes with content fixes (fact corrections, link additions).
Only style fixes are usable for training, so Sonnet sorted them in batches of 40 into three bins.

Verdict	Count	Use
style (wording only)	805	`old` side as slop, `new` side as natural
content	629	not used
mixed	126	not used

Style fixes came out to more than half.
The sorting ran as 39 batches of 40 and finished in under 40 minutes.

Assembling the dataset

All of the material above became a sentence-level binary dataset (slop / natural).

Split	Size	Contents
`train`	3,782 sentences (1,664 slop / 2,118 natural)	style-pair `old` side 899 + scan `strong` 765 as slop; `new` side 797 + clean article sentences 866 + WP diary 455 as natural
`val`	148 sentences	split by article, so no article leaks across train and eval
`gold`	76 sentences	26 human-confirmed misses + 50 held-out WP diary sentences. Never used in training

Two design decisions.
The 1,586 weak scan hits stay out of the first run: the scan agents are instructed to prefer false positives over misses, so training on weak teaches the model those false positives too.
The gold set keeps the verified corpus (misses I flagged and categorized myself) out of training entirely. Evaluation built the same way as the training data cannot show whether the model works on new sentences, so a separate set built purely from human judgment sits beside it.

The model: ModernBERT-ja

The default choice for Japanese BERT is Tohoku BERT, but I picked SB Intuitions’ ModernBERT-ja-130m.
Its training corpus is recent at 4T+ tokens, so it has already seen the era of tech prose it will judge.
It also skips the MeCab-family tokenizer install, and published Japanese classification benchmarks put it ahead of same-size Tohoku BERT.
If it fails, the fallback is Tohoku v3.

Training runs on my main work machine, a Mac mini.

Item	Value
Machine	Mac mini (Apple M4, 16GB unified memory)
OS	macOS 26.3
Python	3.14.4 (venv)
Stack	torch 2.12.1 (MPS) + transformers 5.13.0
Model	sbintuitions/modernbert-ja-130m

Training is fully local, independent of the Sonnet batch scan.

First training run: perfect on one side only

Three epochs took 418 seconds.

Evaluation	Result
`val` (148 sentences)	accuracy 76.4% / F1 0.762
`gold` false positives on 50 human sentences	0 (precision 1.0)
`gold` detection of 26 confirmed misses	11 (recall 42%)

The result skewed hard: zero false positives, less than half detected.
Since it never marks human text as slop, it can sit in front of the pipeline without breaking existing sentences.

The 15 misses were almost all “conditional clause landing on a generalized conclusion” and the “also check X.” inspection-imperative type.
In other words, this classifier missed exactly the same types Sonnet’s scans had been missing all along.

The sloppiness of these two types depends heavily on sitting at the end of a paragraph.
”Also check the npm dependency tree.” is a normal sentence in the middle of a walkthrough, and an inspection imperative when it stands alone at a paragraph’s end.
Feeding the model isolated sentences drops that positional information. The classifier never received the evidence the judgment requires.

Two smaller lessons.
A regex mistake in the dataset builder let 2 bad rows into gold (a template example line and a multi-word summary entry); with those removed, effective recall is 11/24, 46%.
And the best val score came at epoch 1 of 3 — epochs and learning rate were still completely untuned.

v2 hit out-of-memory three times

With all 622 posts scanned, v2 retrained on the full data with position tags and preceding sentences (train 6,530 sentences).
16GB on the M4 Mac mini was not enough.

Attempt	Config	Result
1	batch 16, 256 tokens	OOM (training 7.1GB + other processes 12.9GB)
2	batch 8 × grad accum 2	OOM (training 5.3GB + other 14.9GB)
3	batch 4 × accum 4, 192 tokens	OOM (training 3.5GB + other 16.4GB)
4	+ `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`	completed

Each time the training footprint shrank, memory use by the browser and other processes grew and the margin under the 20GB cap kept shrinking.
The ceiling this time was set not by model size but by whatever else the machine was running.
Removing torch’s watermark cap got it through: actual training demand was under 4GB, so spilling into swap did no harm.

One more snag along the way: exactly one post out of 622 failed the scan on every retry.
It was a write-up of an npm worm, and Sonnet’s cybersecurity safeguards were reacting to the malware-technique descriptions.
It is my own published article, but a headless one-shot call carries no context that the author is style-checking their own text — to the classifier it looks like a malware-analysis request. That one post was marked skipped.

v2 results

Metric	v1 (bare sentences)	v2 (position tags + context + full data)
`val` accuracy / F1	76.4% / 0.762	84.7% / 0.841
`gold` FP on 50 human sentences	0	0
`gold` recall on confirmed misses	42%	52%

val improved by 8+ points and kept climbing each epoch (v1 had plateaued at epoch 1).
False positives on human text stayed at zero on the full dataset.

gold recall stopped at 52% — but this evaluation runs under one condition that training does not.
The gold entries record only the flagged sentence itself, with no surrounding context, so gold is the one evaluation that runs context-free. For a model trained with preceding sentences that is a handicap, and the number likely understates real performance on full articles.
Adding a context field to the corpus entry format removes the distortion from here on.

The remaining 12 misses were again the semantic core of “conditional clause landing on a generalized conclusion” plus evaluative flat assertions.
Catching “The indicators JFrog lists are direct.” without any syntactic cue requires judging that the sentence is an evaluative summary — a semantic call. That is the edge of what a 130M binary classifier reaches unaided; the two paths forward are scaling to 310M and adding the 1,600 verified weak detections.

v3: removing label noise

With scores stuck in the 80s, I went looking at label quality rather than model capacity.
One problem surfaced immediately. The old side of each Edit pair had been labeled slop wholesale, but multi-sentence excerpts contain sentences that were never edited. Any sentence present on both sides carried a slop label it did not deserve.
v3 labels only old-side-exclusive sentences as slop and new-side-exclusive ones as natural, drops everything appearing on both sides, and extends training from 3 to 5 epochs.

The first v3 run died somewhere unrelated to training: 2.6GB of free disk, and neither per-epoch checkpoints nor Metal shader caches could be written.
After the 16GB memory ceiling came the 228GB disk. Clearing 50+GB got the run through.

Metric	v1	v2	v3
`val` accuracy / F1	76.4% / 0.762	84.7% / 0.841	85.5% / 0.851
`gold` FP on 50 human sentences	0	0	0
`gold` recall on confirmed misses	42%	52%	60%

gold recall stepped up 42% → 52% → 60%, with human-text false positives at zero across all three generations.
Evaluative flat assertions that v2 missed — like “The indicators JFrog lists are direct.” — v3 now catches.

The 10 remaining misses concentrate where “conditional clause landing on a generalized conclusion” demands real semantic judgment.
Catching “Builds run constantly during development, so using the build environment as a trigger fires in both CI and local setups at high frequency.” requires reading that the sentence ends in general-law register rather than a specific fact. The 130M binary classifier cannot make that call.

The encoder chapter ends with a front-stage filter that catches 60% of confirmed misses with zero false positives, on 7 minutes of training and millisecond inference.
The remaining 40% — the semantic judgments — go to the next experiment: fine-tuning a small LLM. The 805 style pairs double as slop-to-natural rewrite examples, so the next model will propose fixes, not just flag sentences.

The data collected

Data	Volume	Class	Status
Edit pairs from conversation logs	style 805 / content 629 / mixed 126	style `old` = slop, `new` = natural	sorted by Sonnet
Batch scan of published posts	622 posts, 7,208 detections	2,493 `strong` used as slop	machine-labeled
WordPress diary	~600 sentences	natural (human)	confirmed
Verified corpus	28 entries	`gold` test only	confirmed with type labels
Style feedback moments	36	labeling reference	confirmed