AI slop detector from Claude Code edit logs: ModernBERT-ja, 0 false positives
Contents
The tech posts on this blog get written in all kinds of setups: by me, by Claude, by Codex.
Every post goes through a style check before publishing, but AI-slop phrasing still slips through every time.
The detection rules are written down, yet no matter how hard the LLM is told to watch for them, it keeps producing AI-sounding sentences and keeps skipping the ones that need fixing.
The idea: train a small encoder model to do detection only, and put it in front of the LLM scan.
Sentence-level classification is exactly what BERT-family models are good at, and no generation is needed.
How many sentences does training take
Supervised fine-tuning of an encoder needs on the order of 1,000 slop sentences even for binary classification.
My verified corpus (misses I had personally flagged and confirmed) held 21 entries.
The next candidate was the git commits from publishing posts.
The rewrite diff at publish time keeps pairs of the flagged slop sentence and its fixed version.
Counting them: out of 213 publish commits, only 21 rewrote an existing post by 30+ lines.
And those diffs mix in content additions and link edits, with style fixes making up only about 40%.
Clean pairs would top out at 150–300 sentences, an order of magnitude short.
Building a corpus from the conversation logs
Claude Code stores conversations as JSONL under ~/.claude/projects/.
Article fixes go through the Edit tool, so old_string (before) and new_string (after) sit in the history inside each tool call.
The “flag → fix” pairs are recorded before they ever reach git.
# 会話履歴JSONLからEditツール呼び出しを抽出する骨子
for b in record["message"]["content"]:
if b.get("type") == "tool_use" and b.get("name") == "Edit":
inp = b["input"]
if "src/content/articles" in inp["file_path"]:
pairs.append((inp["old_string"], inp["new_string"]))
The result from two machines.
| Machine | Edit pairs | Posts covered |
|---|---|---|
| Mac mini | 823 | 50 |
| MacBook Pro | 737 | 17 |
| Total | 1,560 | 67 |
The conversation logs hold more than five times what the git diffs could provide.
Not all of it is style work; content fixes and link edits are mixed in, and the style ratio gets sorted out later.
One catch: Claude Code deletes history after 30 days by default.
What I pulled covers only the last month; everything older is gone.
I set cleanupPeriodDays to 365 so future history survives.
As a bonus, I extracted 36 moments where my own typed messages called out style problems.
Comments like “the intransitive verbs and word choices all smell like AI” — checking them against the rulebook showed every recent one had already been converted into a detection rule.
That confirmed, from the opposite direction, that no feedback had been lost on its way into the rules.
Scanning 622 published posts
The style-check rules grew over months of use, which means early posts only ever faced the weak early versions.
Re-reading them with today’s definitions surfaces undetected slop sentences as candidates.
I wrote a batch that feeds Claude Code headless mode (claude -p --model sonnet) one prompt per post: three detection-definition files plus the article body with line numbers.
About 51 seconds per post, across the 622 tech posts published before June 2026.
Detection only; the batch never edits an article.
The output is unverified Sonnet-labeled data, so it lives in a separate directory from my human-confirmed corpus.
At post 204 the plan quota ran out and the remaining 418 came back as empty errors.
The script also had a bug that saved error results as “done”, so I changed it to skip saving errors, stop after 5 consecutive failures, and retry on a one-hour loop.
Final numbers after the full run: 7,208 detections, 7,133 unique sentences, 2,493 rated strong.
15 of the 622 posts came back with zero detections, so it does not just flag everything.
The category distribution put “conditional clause landing on a generalized conclusion” alone at 23%, in first place.
That pattern had been added to the definitions only days earlier, after I flagged it to Claude — and the numbers now show it was Claude’s single most common writing habit across the whole archive.
”English jargon left untranslated” follows at 17%, “closing on an evaluative verb” at 12%, and the top five patterns cover 67%.
The human class: my old WordPress diary
The classifier also needs sentences that are certain to be human-written.
A SQL dump of the WordPress diary I kept from 2011 to 2018 was still around.
Parsing the phpMyAdmin-style INSERT statements produced 23 posts, about 20,000 characters, roughly 600 sentences.
All of it predates Claude, so the human class has zero risk of contamination.
flowchart TD
A[Conversation JSONL<br/>2 machines] -->|Edit pair extraction| P[Before/after pairs 1,560]
B[622 published posts] -->|Sonnet batch scan| H[Machine-labeled slop candidates]
C[WordPress diary SQL] -->|INSERT parsing| N[Human sentences ~600]
D[Verified corpus] --> V[Type-labeled examples 21]
P --> T[Training dataset]
H --> T
N --> T
V --> T
One caveat.
The WP diary is all diary-register writing, while the slop sentences are all technical-register. Paired as-is, there is no telling whether the classifier learned “slop vs natural” or “diary vs tech article”.
To cancel the genre skew, the natural class also gets post-fix sentences from published tech posts: human-approved technical-register text.
Sorting 1,560 edit pairs into style vs content
The Edit pairs mix style fixes with content fixes (fact corrections, link additions).
Only style fixes are usable for training, so Sonnet sorted them in batches of 40 into three bins.
| Verdict | Count | Use |
|---|---|---|
| style (wording only) | 805 | old side as slop, new side as natural |
| content | 629 | not used |
| mixed | 126 | not used |
Style fixes came out to more than half.
The sorting ran as 39 batches of 40 and finished in under 40 minutes.
Assembling the dataset
All of the material above became a sentence-level binary dataset (slop / natural).
| Split | Size | Contents |
|---|---|---|
train | 3,782 sentences (1,664 slop / 2,118 natural) | style-pair old side 899 + scan strong 765 as slop; new side 797 + clean article sentences 866 + WP diary 455 as natural |
val | 148 sentences | split by article, so no article leaks across train and eval |
gold | 76 sentences | 26 human-confirmed misses + 50 held-out WP diary sentences. Never used in training |
Two design decisions.
The 1,586 weak scan hits stay out of the first run: the scan agents are instructed to prefer false positives over misses, so training on weak teaches the model those false positives too.
The gold set keeps the verified corpus (misses I flagged and categorized myself) out of training entirely. Evaluation built the same way as the training data cannot show whether the model works on new sentences, so a separate set built purely from human judgment sits beside it.
The model: ModernBERT-ja
The default choice for Japanese BERT is Tohoku BERT, but I picked SB Intuitions’ ModernBERT-ja-130m.
Its training corpus is recent at 4T+ tokens, so it has already seen the era of tech prose it will judge.
It also skips the MeCab-family tokenizer install, and published Japanese classification benchmarks put it ahead of same-size Tohoku BERT.
If it fails, the fallback is Tohoku v3.
Training runs on my main work machine, a Mac mini.
| Item | Value |
|---|---|
| Machine | Mac mini (Apple M4, 16GB unified memory) |
| OS | macOS 26.3 |
| Python | 3.14.4 (venv) |
| Stack | torch 2.12.1 (MPS) + transformers 5.13.0 |
| Model | sbintuitions/modernbert-ja-130m |
Training is fully local, independent of the Sonnet batch scan.
First training run: perfect on one side only
Three epochs took 418 seconds.
| Evaluation | Result |
|---|---|
val (148 sentences) | accuracy 76.4% / F1 0.762 |
gold false positives on 50 human sentences | 0 (precision 1.0) |
gold detection of 26 confirmed misses | 11 (recall 42%) |
The result skewed hard: zero false positives, less than half detected.
Since it never marks human text as slop, it can sit in front of the pipeline without breaking existing sentences.
The 15 misses were almost all “conditional clause landing on a generalized conclusion” and the “also check X.” inspection-imperative type.
In other words, this classifier missed exactly the same types Sonnet’s scans had been missing all along.
The sloppiness of these two types depends heavily on sitting at the end of a paragraph.
”Also check the npm dependency tree.” is a normal sentence in the middle of a walkthrough, and an inspection imperative when it stands alone at a paragraph’s end.
Feeding the model isolated sentences drops that positional information. The classifier never received the evidence the judgment requires.
Two smaller lessons.
A regex mistake in the dataset builder let 2 bad rows into gold (a template example line and a multi-word summary entry); with those removed, effective recall is 11/24, 46%.
And the best val score came at epoch 1 of 3 — epochs and learning rate were still completely untuned.
v2 hit out-of-memory three times
With all 622 posts scanned, v2 retrained on the full data with position tags and preceding sentences (train 6,530 sentences).
16GB on the M4 Mac mini was not enough.
| Attempt | Config | Result |
|---|---|---|
| 1 | batch 16, 256 tokens | OOM (training 7.1GB + other processes 12.9GB) |
| 2 | batch 8 × grad accum 2 | OOM (training 5.3GB + other 14.9GB) |
| 3 | batch 4 × accum 4, 192 tokens | OOM (training 3.5GB + other 16.4GB) |
| 4 | + PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 | completed |
Each time the training footprint shrank, memory use by the browser and other processes grew and the margin under the 20GB cap kept shrinking.
The ceiling this time was set not by model size but by whatever else the machine was running.
Removing torch’s watermark cap got it through: actual training demand was under 4GB, so spilling into swap did no harm.
One more snag along the way: exactly one post out of 622 failed the scan on every retry.
It was a write-up of an npm worm, and Sonnet’s cybersecurity safeguards were reacting to the malware-technique descriptions.
It is my own published article, but a headless one-shot call carries no context that the author is style-checking their own text — to the classifier it looks like a malware-analysis request. That one post was marked skipped.
v2 results
| Metric | v1 (bare sentences) | v2 (position tags + context + full data) |
|---|---|---|
val accuracy / F1 | 76.4% / 0.762 | 84.7% / 0.841 |
gold FP on 50 human sentences | 0 | 0 |
gold recall on confirmed misses | 42% | 52% |
val improved by 8+ points and kept climbing each epoch (v1 had plateaued at epoch 1).
False positives on human text stayed at zero on the full dataset.
gold recall stopped at 52% — but this evaluation runs under one condition that training does not.
The gold entries record only the flagged sentence itself, with no surrounding context, so gold is the one evaluation that runs context-free. For a model trained with preceding sentences that is a handicap, and the number likely understates real performance on full articles.
Adding a context field to the corpus entry format removes the distortion from here on.
The remaining 12 misses were again the semantic core of “conditional clause landing on a generalized conclusion” plus evaluative flat assertions.
Catching “The indicators JFrog lists are direct.” without any syntactic cue requires judging that the sentence is an evaluative summary — a semantic call. That is the edge of what a 130M binary classifier reaches unaided; the two paths forward are scaling to 310M and adding the 1,600 verified weak detections.
v3: removing label noise
With scores stuck in the 80s, I went looking at label quality rather than model capacity.
One problem surfaced immediately. The old side of each Edit pair had been labeled slop wholesale, but multi-sentence excerpts contain sentences that were never edited. Any sentence present on both sides carried a slop label it did not deserve.
v3 labels only old-side-exclusive sentences as slop and new-side-exclusive ones as natural, drops everything appearing on both sides, and extends training from 3 to 5 epochs.
The first v3 run died somewhere unrelated to training: 2.6GB of free disk, and neither per-epoch checkpoints nor Metal shader caches could be written.
After the 16GB memory ceiling came the 228GB disk. Clearing 50+GB got the run through.
| Metric | v1 | v2 | v3 |
|---|---|---|---|
val accuracy / F1 | 76.4% / 0.762 | 84.7% / 0.841 | 85.5% / 0.851 |
gold FP on 50 human sentences | 0 | 0 | 0 |
gold recall on confirmed misses | 42% | 52% | 60% |
gold recall stepped up 42% → 52% → 60%, with human-text false positives at zero across all three generations.
Evaluative flat assertions that v2 missed — like “The indicators JFrog lists are direct.” — v3 now catches.
The 10 remaining misses concentrate where “conditional clause landing on a generalized conclusion” demands real semantic judgment.
Catching “Builds run constantly during development, so using the build environment as a trigger fires in both CI and local setups at high frequency.” requires reading that the sentence ends in general-law register rather than a specific fact. The 130M binary classifier cannot make that call.
The encoder chapter ends with a front-stage filter that catches 60% of confirmed misses with zero false positives, on 7 minutes of training and millisecond inference.
The remaining 40% — the semantic judgments — go to the next experiment: fine-tuning a small LLM. The 805 style pairs double as slop-to-natural rewrite examples, so the next model will propose fixes, not just flag sentences.
The data collected
| Data | Volume | Class | Status |
|---|---|---|---|
| Edit pairs from conversation logs | style 805 / content 629 / mixed 126 | style old = slop, new = natural | sorted by Sonnet |
| Batch scan of published posts | 622 posts, 7,208 detections | 2,493 strong used as slop | machine-labeled |
| WordPress diary | ~600 sentences | natural (human) | confirmed |
| Verified corpus | 28 entries | gold test only | confirmed with type labels |
| Style feedback moments | 36 | labeling reference | confirmed |