Qwen3.7 Max vs Plus on a Japanese novel: few fixes, a name misread as a typo
Contents
I tested whether Qwen3.7 Max and Plus can proofread a Japanese novel. What I wanted to check isn’t the ability to fix typos, but whether they avoid flattening dialect and voice into “errors.” Kyoto dialect, a character’s way of speaking, proper nouns, older spellings, spots where a ruby reading leaked into the body text — does the model smooth all of that into standard Japanese?
The text I used is a novel I borrowed, after some pestering, from its author @vanmadoy, so I can’t post the full text. What I show here is only the short detected spans and the judgments. If you’re curious, you can read the whole thing on Kakuyomu: 京都市民限定で求人が出ているとあるバイトについて. Thanks for lending it.
A public API, but the weights are closed
Qwen3.7 Max and Plus are a commercial API with closed weights, served from Alibaba Cloud’s Model Studio (formerly DashScope). You call the model names qwen3.7-max / qwen3.7-plus from an OpenAI-compatible endpoint, billed per token.
Pricing, per million tokens:
| Model | Input | Output |
|---|---|---|
| Qwen3.7 Max | $2.50 | $7.50 |
| Qwen3.7 Plus | $0.32 | $1.28 |
One proofreading request measured 158 input tokens and 67 output tokens (thinking off). Running all 139 candidates from the ~2,900-character text through both models costs about $0.13 on Max and $0.02 on Plus — roughly 20 cents total. Turning thinking mode on multiplies the output, so it goes up.
Measuring the ability NOT to fix
When you use a model as a proofreader, the ability not to fix comes before the ability to fix correctly. In a novel, expressions that fall outside standard Japanese carry information about the character and the scene. Flatten Kyoto dialect into standard Japanese, swap a name for a common word, or treat leaked ruby as a typo, and the text breaks.
A binary FIX/KEEP forces dialect, proper nouns, and ruby all into the “don’t fix” bucket together. So I split the judgment into four classes.
| Label | Meaning |
|---|---|
| FIX | A clear typo, fine to correct |
| KEEP | Keep as dialect, voice, proper noun, or author’s spelling |
| RUBY | Possible leaked ruby or ruby-style notation; send to a human |
| ESCALATE | Undecidable; send to a human |
The candidate spans come from a BERT scan that picks up low-probability tokens. I applied the same idea as my earlier BERT+Qwen OCR correction pipeline, but to novel text instead of OCR. Back then a small local model went past typos and “corrected” things it shouldn’t, so this time I had the flagship models judge with the four classes.
What BERT picks up is “rare words”
The text produced 139 candidate spans. They read less like typo candidates and more like a list of words that are rare to a BERT trained on general writing.
| span | BERT’s top candidate | What it is |
|---|---|---|
とる in やっとる | てる | Reading Kyoto dialect toward standard Japanese |
あん (あんねん) | ある | Reacting to a Kyoto-dialect ending |
カタログ in webカタログ | で | Loanword-laced shop vocabulary |
悪 in 悪筆 | 鉛 | A fun association, but a risky candidate |
情 in 情なさけ | あん | A spot where ruby leaked into the body text |
崇 in 崇しゅう | あい | Ruby-style notation |
As a detector it works. But routing the candidates straight to correction mixes dialect, proper nouns, ruby, and old spellings all into “suspicious.” So a stage that labels each one is needed.
Max and Plus barely fix anything
I had Max and Plus judge the same 139 candidates with the four classes. Conditions: temperature=0, thinking off.
| Label | Max | Plus |
|---|---|---|
| KEEP | 132 | 127 |
| FIX | 3 | 6 |
| RUBY | 3 | 3 |
| ESCALATE | 1 | 3 |
Both barely fix anything. Of the 139, Max kept 132 and Plus kept 127. Kyoto dialect, names, old spellings — all left as is. That’s the clear opposite of the small model that corrected past typos in my earlier post.
RUBY also matched at three each for both models. Spots like 情なさけ and 崇しゅう, where ruby leaked into the body text, get sorted into RUBY instead of being knocked into “typo.” The distinction that a binary FIX/KEEP crushed survives with four classes.
The 10 cases where Max and Plus split
More than the totals, the contents of the 10 cases where the judgments split show the difference.
Closing quote marks
The most common was the closing 」 at the end of dialogue. BERT predicts a period 。 at the 」 position with high confidence. Plus marked all four of these FIX, judging they should become 。」. Max kept all four, reasoning that the closing quote of dialogue is correct and BERT’s prediction is the error.
| End of dialogue | Max | Plus |
|---|---|---|
| …発送するだけ」 | KEEP | change to 。」 |
| …更新しとくもんやね」 | KEEP | change to 。」 |
| …勉強してたよ」 | KEEP | change to 。」 |
| …負けといたる」 | KEEP | change to 。」 |
Japanese novels generally don’t put a period before the closing quote, so Max held the format as is. Plus mechanically applied the period-then-bracket rule.
Where Max goes too far
Conversely, there are cases where Max over-corrects. For 報告だと, Max called it contextually unnatural and FIXed it to 報告だが. Plus kept 〜の報告だと as a valid reportive expression, and here Plus is right. For a sentence-final ?, in a spot with almost no surrounding text, Max FIXed it to 。 on BERT’s confidence. Plus couldn’t get the context and ESCALATEd as undecidable. When context is thin, Plus errs on the safe side.
Proper nouns and dialect
For 五回生 (Kansai universities use it for a fifth-year student), Max kept it as correct, consistent with the surrounding Kansai dialect, while Plus ESCALATEd it as unclear. For 顔を顰める, Max kept it as an idiom, while Plus FIXed it as a misuse of 眉を顰める. Neither is clearly right or wrong, but Max commits while Plus sends ambiguous cases to a human. ESCALATE was 1 for Max versus 3 for Plus — the same tendency, less a matter of which model is better than a difference in editorial temperament.
Turning on thinking mode changes the behavior
Up to here I judged with thinking off (enable_thinking: false). I ran the same 139 candidates again with thinking on.
| Setting | FIX | KEEP | RUBY | ESCALATE |
|---|---|---|---|---|
| Max off | 3 | 132 | 3 | 1 |
| Max on | 8 | 125 | 2 | 4 |
| Plus off | 6 | 127 | 3 | 3 |
| Plus on | 4 | 129 | 2 | 4 |
For the same “turn on thinking” operation, Max and Plus moved in opposite directions. Max’s FIX rose from 3 to 8 and it became more interventionist; Plus’s dropped from 6 to 4.
| span | Max off→on | Plus off→on |
|---|---|---|
」 (end of dialogue) | KEEP → KEEP | FIX → KEEP |
報告だと | FIX → KEEP | KEEP → KEEP |
あん (あんねん) | KEEP → FIX | KEEP → KEEP |
顔を顰める | KEEP → FIX | FIX → FIX |
Thinking erased Plus’s closing-quote error. The four 」 it had FIXed with thinking off all returned to KEEP with thinking on, and the reasoning changed to “correct as a dialogue closing quote.” Plus’s one systematic error withdrew itself during the thinking.
Max went the other way. While it corrected the 報告だと and ? it had wrongly FIXed (back to KEEP and ESCALATE), it started FIXing things it had kept with thinking off — あん (Kyoto dialect) as “a typo for あるねん,” 顔を顰める as “眉 is correct.” In exchange for fixing its hasty calls, it reaches into dialect and idioms. Plus, too, FIXed 五回生 as “a typo for 五年生” with thinking on; the longer it thinks, the more it leans toward standard Japanese in places.
The cost of thinking shows up as time.
| Model | Mean | Median | Max |
|---|---|---|---|
| Max on | 28.7s | 20.1s | 165.2s |
| Plus on | 42.2s | 25.3s | 1478.8s |
The medians are Max 20s and Plus 25s, but Plus had one candidate it kept thinking about for roughly 24 minutes (1478s), and the 139 took about two hours. With thinking off it’s around 2 seconds per item, so the added cost is entirely time.
One more thing: with thinking on, text that looks like leftover reasoning leaked after the judgment JSON, and strict JSON parsing failed once for each model. Switching to a lenient parser that grabs only the first JSON object with a regex recovered both as KEEP. If you combine forced JSON with thinking mode, the parser side has to be loose or you drop results.
Throwing the text in without BERT
Up to here I extracted 139 candidates with BERT and judged them one at a time. For comparison, I passed the body text straight to Max and Plus without BERT and asked only for typos.
Direct injection flags almost nothing. Max came up with 2, Plus with 1. That’s a different scale of inspection from BERT picking up 139 rare words and judging each.
| Model | Typos flagged by direct injection |
|---|---|
| Max | 情なさけ→情け, 崇しゅう→祟しゅう |
| Plus | 崇しゅう→祟しゅう |
The striking one is 崇しゅう → 祟しゅう, which both models flagged. In the text the protagonist is called 祟 six times in dialogue, but the body text has 崇しゅう just once. Both models picked up the difference in the character and judged that, since everywhere else is 祟, the 崇 must be a typo. The span-level judgment only takes the surrounding sentence, so there it stopped at RUBY.
But this is likely a misjudgment. Reading the surrounding text, 崇しゅう reads as a different person from the protagonist 祟 — same reading, different kanji. The AI wasn’t given this novel’s setting, so it jumped to “typo” for an intentional distinction. Noticing the difference in the character at all is a level of detail a human would skim past; it missed because it had no material to separate typo from intent.
Direct injection also loses nuance. 情なさけ (leaked ruby), which the span judgment sorted into RUBY, Max in direct injection simply FIXed to 情け. There’s no four-class scheme; it comes down to a binary fix-or-not.
Narrow the range you let it read and a different kind of breakage appears. Given only the first half of the text, Max treated the protagonist’s name 祟 as a typo and tried to “fix” it to 僕. Given the full text it self-corrects. The proper-noun calls waver with how much context you hand over.
In the end, having an AI proofread a novel takes more than the body text. Without the characters’ names and readings and the world’s setting handed over together, you can’t separate whether 崇 and 祟 are different people or one typo. The AI even catches a character difference a human would glide past, but whether that’s an error or deliberate craft can only be decided by someone who knows the setting.