Automated OCR Error Detection and Correction with Encoder Models + Local LLM

Background

I’d been using local LLMs like Qwen and Swallow to correct OCR text. They work to a degree, but have problems: filling based on context without understanding it (Qwen), or rewriting the text freely (Swallow).

So I tried an encoder-based approach using small models (LUKE/BERT) instead of generative AI (decoder-based). Encoder models look at context from both sides of a passage when filling in blanks — well-suited for “inferring missing characters without breaking the original text.”

Kyoto University’s WikipediaAnnotatedCorpus is a ~9,000-article corpus annotated with morphological analysis, syntactic parsing, case analysis, and coreference analysis. Fine-tuning LUKE on this could improve accuracy for Japanese particle consistency and implicit subject completion.

Experiment Environment

Item	Spec
GPU	NVIDIA GeForce RTX 4060 Laptop (VRAM 8GB)
Main memory	32GB
OS	Windows 11 (working in WSL2 Ubuntu 22.04)
Python	3.10.12 (WSL2 side)
CUDA	12.9

Checking the Environment

First, check the Windows side:

$ nvidia-smi
NVIDIA-SMI 576.02    Driver Version: 576.02    CUDA Version: 12.9
RTX 4060 Laptop GPU  |  0MiB / 8188MiB  |  0%

$ python --version
Python 3.14.0

$ wsl --list
Ubuntu-22.04 (Default)

Windows Python 3.14 isn’t supported by PyTorch yet, so I work on the WSL2 Ubuntu 22.04 side.

Checking inside WSL2:

$ python3 --version
Python 3.10.12

$ nvidia-smi
NVIDIA-SMI 575.51.02    Driver Version: 576.02    CUDA Version: 12.9

GPU visible from WSL2. Python 3.10 works with PyTorch. pip3 and venv weren’t installed, so add them.

WSL2 Setup

# Install pip3, venv
sudo apt update && sudo apt install -y python3-pip python3-venv

# Create virtual environment
python3 -m venv ~/luke-ocr
source ~/luke-ocr/bin/activate

# PyTorch (CUDA 12.4 compatible)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# transformers 4.x (5.x has a bug with MLukeTokenizer)
pip install 'transformers>=4.40,<5' sentencepiece protobuf tiktoken

I initially installed transformers 5.2.0, but MLukeTokenizer initialization threw TypeError: argument 'vocab': 'dict' object cannot be converted to 'Sequence'. Downgrading to 4.57.6 resolved it.

Post-install verification:

import torch
print(torch.__version__)        # 2.6.0+cu124
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0))  # NVIDIA GeForce RTX 4060 Laptop GPU

import transformers
print(transformers.__version__)  # 4.57.6

LUKE Fill-mask Baseline Test

First, check fill-mask accuracy with plain LUKE before any fine-tuning. Using studio-ousia/luke-japanese-base-lite. VRAM usage is just 512MB.

from transformers import MLukeTokenizer, LukeForMaskedLM
import torch

model_name = "studio-ousia/luke-japanese-base-lite"
tokenizer = MLukeTokenizer.from_pretrained(model_name)
model = LukeForMaskedLM.from_pretrained(model_name).to("cuda")

Basic Tests

Input	Top1 prediction	Confidence	Verdict
織田信長は京都の本能寺で`<mask>`した。	死去 (died)	32.0%	Close (“自害/suicide” would be ideal)
東京は日本の`<mask>`である。	首都 (capital)	60.1%	Correct
太郎はパンを`<mask>`べた。	た	64.7%	Correct (“食”べた / ate)
彼女は大学で`<mask>`を学んでいる。	数学 (mathematics)	7.6%	Plausible (low confidence)

OCR Defect Simulation

Testing patterns common in OCR errors — missing particles, inflectional suffixes, and proper nouns:

Pattern	Input	Answer	Top1	Confidence	Hit
Particle	日本語`<mask>`文法は複雑である。	の	:	16.0%	x (の at rank 2: 10.2%)
Particle	彼`<mask>`東京に住んでいる。	は	もまた	11.4%	x (は at rank 2: 10.0%)
Okurigana	彼は会議に出`<mask>`した。	席	発	21.6%	x (席 at rank 2: 14.8%)
Okurigana	この問題を解`<mask>`するのは難しい。	決	こうと	20.0%	x
Proper noun	`<mask>`島県は四国にある。	徳	-	5.4%	x
Proper noun	ノーベル`<mask>`学賞を受賞した。	物理/etc.	生理	34.1%	△ (生理学賞 is also valid)
Context	彼は毎朝6時に`<mask>`きる。	起	起きて	5.6%	△ (token boundary difference)
Context	この薬は食後に`<mask>`んでください。	飲	飲	84.6%	o

Baseline Observations

Strong context (medicine → 飲む/drink) achieves high accuracy without fine-tuning
Particle inference gets the direction right but often fails to rank top
Proper nouns (place names) are weak — single-character defects at sentence start are especially poor
Mid-compound defects like “解決する” have poor compatibility with subword tokenization

The interesting question is how fine-tuning on the Kyoto University corpus changes accuracy for particles and case relations.

Preparing the Kyoto University Corpus

Clone WikipediaAnnotatedCorpus and examine the data:

git clone https://github.com/ku-nlp/WikipediaAnnotatedCorpus.git

Directory structure:

knp/: KNP-format annotated data (3,979 files)
org/: Raw text
id/: Train/dev/test splits (train 3,679, dev 100, test 200)

Inside KNP Format

Opening the Ashikaga Takauji Wikipedia article (wiki00010002.knp):

# S-ID:wiki00010002-00-01
* 3D
+ 1D
足利 あしかが 足利 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:head>
+ 8D <NE:PERSON:足利　尊氏>
尊氏 たかうじ 尊氏 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:tail>
は は は 助詞 9 副助詞 2 * 0 * 0 NIL
...
+ -1D <rel type="ガ" target="尊氏" .../>
武将 ぶしょう 武将 名詞 6 普通名詞 1 * 0 * 0 NIL

Each line contains a morpheme (surface form, reading, base form, part of speech). + lines contain case analysis (nominative, accusative, dative…). <NE:...> marks named entity tags.

Parsing with rhoknp

Using the rhoknp library recommended in the README (pyknp is outdated):

pip install rhoknp

from rhoknp import Document

with open("knp/wiki0001/wiki00010002.knp") as f:
    doc = Document.from_knp(f.read())

for sent in doc.sentences:
    text = "".join(m.text for m in sent.morphemes)
    print(text)
    # => 足利　尊氏は、鎌倉時代末期から室町時代前期の武将。

Case analysis is also accessible:

Sentence: 足利　尊氏は、鎌倉時代末期から室町時代前期の武将。
  [末期から] --ノ--> 時代
  [武将。]   --ガ--> 尊氏
  [武将。]   --カラ--> 末期

Text Extraction

Extracted plain text from all KNP files, filtering out short sentences under 10 characters (like parenthetical readings):

Split	Sentence count
train	10,243
dev	312
test	545

Fine-tuning

Training Configuration

BATCH_SIZE = 8
EPOCHS = 3
LR = 5e-5
MAX_LEN = 128
MASK_PROB = 0.15  # Randomly mask 15% of tokens

Standard MLM (Masked Language Modeling): randomly replace 15% of input tokens with <mask> and predict the originals. Of those, 80% are masked, 10% are random tokens, 10% are kept as-is.

Training Results

Epoch	Train Loss	Dev Loss	Time	VRAM Peak
1	2.050	1.943	203s	2,677 MB
2	1.919	1.933	200s	2,677 MB
3	1.692	1.853	202s	2,677 MB

3 epochs total in ~10 minutes. Only 2.7GB of the 8GB VRAM used — room to increase batch size to 16 or 32.

Baseline vs. Fine-tuned Comparison

Same test sentences before and after fine-tuning. “o” if the correct answer is in the Top 3:

Test	Answer	Baseline Top1	FT Top1	Improved
本能寺で`<mask>`した	死去	死去(32.0%)	修行(20.7%)	x Worse
日本の`<mask>`である	首都	首都(60.1%)	首都(85.6%)	o Large gain
パンを`<mask>`べた	た	た(64.7%)	た(69.9%)	o Slight gain
大学で`<mask>`を学ぶ	数学/etc.	数学(7.6%)	数学(16.5%)	o Improved
日本語`<mask>`文法は	の	:(16.0%)	では(13.7%)	x No change
彼`<mask>`東京に	は	もまた(11.4%)	自身は(38.2%)	x Worse
出`<mask>`した	席	発(21.6%)	席(82.6%)	o Large gain
解`<mask>`するのは	決	こうと(20.0%)	こうと(25.2%)	x No change
`<mask>`島県は四国	徳	-(5.4%)	(24.3%)	x No change
ノーベル`<mask>`学賞	物理/etc.	生理(34.1%)	生理(75.5%)	△ Confidence up
6時に`<mask>`きる	起	起きて(5.6%)	起(27.8%)	o Large gain
食後に`<mask>`んで	飲	飲(84.6%)	飲(79.0%)	o Maintained

Observations

Improved:

“出席した”: 14.8% → 82.6% (moved to Top1)
“首都”: 60.1% → 85.6% (confidence up)
“起きる”: Outside Top3 → 27.8% at Top1 (large gain)
“数学”: 7.6% → 16.5% (confidence doubled)

Degraded:

“死去した”: 32.0% → 5.5%, “修行” moved to Top1
Single-particle inference still weak

No change:

Single-character defects in proper nouns (“徳島県”) still fail
Mid-compound defects (“解決する”) also no improvement

Since the Kyoto corpus is Wikipedia encyclopedic text, encyclopedic fill-ins (“首都”, “出席”) improved substantially. Standalone particles and partial proper noun defects need a different approach — character-level models or OCR-specific training data.

The Wikipedia-Feeding-Wikipedia Problem

Realized here: LUKE’s pre-training data is Wikipedia + BookCorpus, and the Kyoto corpus is also extracted from ~4,000 Wikipedia articles. So I was feeding LUKE a domain it had already learned, without utilizing the annotations.

This training only used raw text from the corpus — the case analysis (nominative, accusative, dative relations), coreference analysis (pronoun referents), and named entity tags were all unused. The real value of the Kyoto corpus is those annotations, so getting proper value requires multi-task learning for case label prediction or named entity classification.

Directions for better results:

Design tasks that actually use the annotations
Train on non-Wikipedia domains (historical documents, government documents, etc.)
Switch to a base model trained on non-Wikipedia data

Comparison with Tohoku University BERT v3

Also testing cl-tohoku/bert-base-japanese-v3 — same Wikipedia-based pre-training as LUKE, but different architecture and tokenizer:

LUKE: SentencePiece tokenizer (subword splitting)
Tohoku BERT: MeCab + WordPiece (morpheme-based splitting)

BERT v3 has the same Wikipedia overlap problem since it was pre-trained on Wikipedia + CC-100. But the tokenizer difference may affect results.

Additionally need fugashi (MeCab Python binding) and unidic-lite:

pip install fugashi unidic-lite

Training Results

Epoch	Train Loss	Dev Loss	Time	VRAM Peak
1	1.503	1.295	201s	2,417 MB
2	1.370	1.395	198s	2,417 MB
3	1.228	1.419	198s	2,417 MB

Overall lower loss than LUKE. However dev loss slightly increased from Epoch 2→3, showing a mild overfitting tendency.

Four-Model Comparison

Comparing all four patterns: LUKE (baseline / FT) and Tohoku BERT (baseline / FT):

Test	Answer	LUKE raw	LUKE FT	BERT raw	BERT FT
本能寺で○した	死去	死去(32%)	修行(21%)	自害(46%)	自害(34%)
日本の○である	首都	首都(60%)	首都(86%)	地名(51%)	首都(67%)
パンを○べた	た	た(65%)	た(70%)	お(2%)	焼く(23%)
大学で○を学ぶ	数学/etc.	数学(8%)	数学(17%)	哲学(8%)	哲学(14%)
日本語○文法	の	:(16%)	では(14%)	の(99%)	の(99%)
彼○東京に	は	もまた(11%)	自身は(38%)	は(81%)	は(86%)
出○した	席	発(22%)	席(83%)	##頭(62%)	##頭(34%)
解○するのは	決	こうと(20%)	こうと(25%)	と(34%)	##法(48%)
○島県は四国	徳	-(5%)	(24%)	淡路(4%)	十(17%)
ノーベル○学賞	物理/etc.	生理(34%)	生理(76%)	物理(80%)	物理(51%)
6時に○きる	起	起きて(6%)	起(28%)	しゃべり(7%)	始まり(6%)
食後に○んで	飲	飲(85%)	飲(79%)	飲ま(44%)	お(34%)

Model Characteristic Differences

Tohoku BERT uses a MeCab morpheme analysis tokenizer, so it’s overwhelmingly better at particle inference (“の”, “は”). Morpheme-level tokenization means particles exist as independent single tokens.

LUKE uses SentencePiece subword splitting, so it’s more flexible for character-level fill-in (“出○した” → “席”) where LUKE has an edge.

Strength	LUKE	Tohoku BERT
Particle inference	Weak	Very strong
Character-level fill-in	Flexible	Subword effect (## prefix)
Proper nouns	Weak	Weak (both)
FT improvement range	Large gain in some cases	Limited (Wikipedia overlap)

Neither model showed substantial improvement from Kyoto corpus fine-tuning. The Wikipedia overlap problem exists equally for Tohoku BERT.

Testing on Actual OCR Output

The fill-mask tests so far assumed we know where the <mask> goes. Real OCR doesn’t tell you “this is a defect” — a completely different approach is needed.

Setting Up NDLOCR-Lite

Installing the National Diet Library’s lightweight OCR NDLOCR-Lite in WSL2. For setup details see Running NDLOCR-Lite on Windows.

git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
pip install -r requirements.txt

Running OCR on a sample image (1963 National Diet Library staff manual). CPU-only, 1.5 seconds:

cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output /root/ocr-output

OCR Output and Misreads

Part of the output text:

(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。取扱いに
当っては気送子投入優すみやかにフタを閉め速度を調整

Visually confirmed misreads:

“送付雪” → correct: “送付管”
“投入優” → correct: “投入後”
“投入ロ” → correct: “投入口” (katakana “ロ” → kanji “口”)
“待成する” → correct: “待機する”
“(z)” → correct: “(ヱ)“

Perplexity-Based Correction

Fill-mask assumes we know the <mask> position, but OCR correction requires finding “what’s wrong” first.

The approach: mask each token one at a time and calculate “the probability that this character belongs at this position.” Tokens with extremely low probability (threshold: under 1%) are contextually inconsistent — likely misreads.

def check_line(text, tokenizer, model, threshold=0.01):
    encoding = tokenizer(text, return_tensors="pt")
    input_ids = encoding["input_ids"][0]
    suspects = []

    for i in range(len(input_ids)):
        # Replace token i with mask
        masked_ids = input_ids.clone().unsqueeze(0).to("cuda")
        original_id = masked_ids[0, i].item()
        masked_ids[0, i] = tokenizer.mask_token_id

        with torch.no_grad():
            outputs = model(input_ids=masked_ids, ...)

        probs = torch.softmax(outputs.logits[0, i], dim=-1)
        original_prob = probs[original_id].item()

        if original_prob < threshold:
            # Suspicious token → generate correction candidates
            top_tokens = probs.topk(3)
            suspects.append(...)
    return suspects

Misreads Detected by All Four Models

Ran the same OCR text through all four models (LUKE raw / LUKE FT / BERT raw / BERT FT) and compared detection on key misread locations:

OCR text	Correct	LUKE raw	LUKE FT	BERT raw	BERT FT
送付雪	管	箱(38%)	機(29%)	機(17%)	ポスト(9%)
投入優	後	後(38%)	後(44%)	後(73%)	後(84%)
投入ロ	口	口(61%)	口(67%)	口(84%)	口(56%)

All four models flagged all three locations as “suspicious” (probability under 1%). High reproducibility.

“優→後” was most confident with BERT FT at 84%. “ロ→口” was BERT raw at 84%. Both had the correct candidate as Top1.

“雪→管” didn’t produce “管” as a top candidate — “箱” or “機” instead. Not correct, but detection itself (“雪 is wrong here”) worked. Given “気送子送付管” is specialized terminology, a general language model preferring “機” or “箱” over “管” is reasonable.

Correction Observations

The technique itself — perplexity-based detection — is more important than fine-tuning
All four models flag the same locations as suspicious, minimizing model-to-model variation
Tohoku BERT generally has higher correction candidate accuracy (“後” 73–84%, “口” 84%)
Specialized term correction candidates go off-target (“送付管” isn’t in general vocabulary)
False positives (flagging correct text as suspicious) are frequent: specialized terms like “気送子”, “ステーション”, and proper nouns all get flagged

Three-Stage Pipeline: BERT Detection → Small LLM Correction

BERT detection accuracy is sufficient, but having humans manually check every flagged location isn’t realistic. Can we combine a small local LLM to automate the detection→correction pipeline?

Setting Up ollama

Installing ollama in WSL2 and pulling small models:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:1.5b  # 986MB
ollama pull qwen2.5:3b    # 1.9GB

BERT uses ~2.4GB of VRAM, so inference runs separately. BERT unloads after detection, then ollama uses GPU for LLM inference.

v1: Request Corrections Per Line

First approach: detect suspicious locations with BERT, then send the entire line to LLM with that info and ask it to correct:

Result: Complete failure.

1.5B: Ignores original text and rewrites completely. “気送管” → “真空管”, “気送管による” → “気圧管による” — generates a different document
3B: Better than 1.5B, but still rewrites arbitrarily. “目的層の” → “目的地の”, “消える” → “点灯する” (opposite meaning)

Small LLMs asked to “correct the whole line” generate freely within the context window and can’t preserve the original. Same problem experienced with Qwen/Swallow before.

v2: Yes/No Judgment Per Character

For each token BERT detected, ask 3B individually: “Is this correction correct? YES or NO?”

Result: All NO.

Line[5] "ロ"→"口" (84%): LLM=NO -> Rejected
Line[6] "優"→"後" (73%): LLM=NO -> Rejected

All 35 cases NO — including correct corrections like “優→後” and “ロ→口”. Small models have a conservative bias: “when unsure, don’t change.” Binary YES/NO classification is a “high-stakes” task for small models, and they default to NO.

v3: A/B Selection + Enhanced Filtering

Instead of YES/NO, show the original and corrected lines side by side and ask “which is correct, A or B?” Also filter BERT detection results:

Exclude cases where BERT’s Top1 candidate is punctuation, particle, or common verb (many cross-line false positives)
Exclude Top1 probability under 30%
Exclude ##-prefixed subword tokens

After filtering: 14 cases → LLM approved 5 corrections:

Line	Original	Corrected	BERT prob	LLM verdict	Correct?
6	優	後	73%	B (accepted)	Correct
17	交通	通行	87%	B (accepted)	Correct
4	落下	接続	32%	B (accepted)	Wrong correction
3	##納	雪	53%	B (accepted)	Meaningless
16	##子	管	47%	B (accepted)	Meaningless

2 correct corrections, 1 wrong correction, 2 meaningless corrections from subword issues.

What Was Missed

“投入ロ→口”: BERT detected at 84%, but LLM chose A (keep original). Katakana “ロ” (U+30ED) and kanji “口” (U+53E3) are visually similar — 3B can’t distinguish them
“二っ→つ”: BERT detected at 100%, but LLM rejected
“送付雪→管”: BERT’s Top1 wasn’t “管” so the correction candidate was off

Pipeline Assessment

The three-stage pipeline direction is correct, but 3B-class local models have accuracy limits.

Detection (BERT):

Perplexity-based detection is stable. Four models flag the same locations — high reproducibility
Threshold and filter tuning can reduce false positives to some extent
But specialized terms (“気送子”, “ステーション”) will keep getting flagged structurally

Correction (small LLM):

1.5B: Destroys original. Unusable for correction
3B (whole line): Tends to rewrite original
3B (YES/NO): Conservative bias of all-NO
3B (A/B selection): Some correct judgments, but ~50% accuracy

Practically, presenting BERT’s detection results (suspicious locations + top candidates) to humans for final judgment is the optimal solution at this point. Full automation requires 7B+ models or OCR-specific fine-tuning.

7B Model as a Comeback

If 3B isn’t enough, what about 7B? Qwen2.5:7b is 4.7GB at 4-bit quantization — fits in 8GB VRAM if BERT is unloaded.

Does It Know “気送子”?

First, a vocabulary test:

Q: What is "気送子"?

3B: "気送子" is Chinese slang for someone who is gentle and considerate of others.
7B: "気送子" refers to a device or system that uses air to transport powders and fine particles.

3B is complete hallucination. 7B is accurate. Parameter count directly correlating with vocabulary is a textbook example.

3B vs 7B: Same Test Accuracy

Compared 3B and 7B judgments on the 14 correction candidates BERT detected, with human-assigned ground truth:

Line	Original	Candidate	BERT prob	Correct	3B	7B
0	##子	管	63%	KEEP	x	o
0	送付	真空	35%	KEEP	x	o
2	気	空気	78%	KEEP	x	o
3	##納	雪	53%	KEEP	x	o
4	落下	接続	32%	KEEP	x	o
5	ロ	口	84%	FIX	o	x
6	優	後	73%	FIX	o	o
10	呼ぶ	なる	31%	KEEP	x	o
12	層	地	84%	KEEP	x	o
16	##子	管	47%	KEEP	x	o
17	落下	バス	32%	KEEP	x	o
17	交通	通行	87%	FIX	o	o
21	請求	館	59%	KEEP	x	x
21	っ	つ	100%	FIX	o	o
				Accuracy	29%	86%

7B got 12/14 correct. A jump from 29% to 86%.

What 7B Got Right

For KEEP judgments (where the original should be preserved) that 3B got completely wrong, 7B got 9 out of 10 correct. Knowing “気送子”, “出納台”, “落下” as vocabulary means it can say “no, this is actually correct” even when BERT flags it as suspicious.

For FIX judgments, “優→後”, “交通→通行”, “っ→つ” were all correctly accepted.

Walls 7B Couldn’t Break Through

Katakana “ロ” vs. kanji “口”:

7B also judged KEEP (keep original). For text-based LLMs, distinguishing katakana “ロ” (U+30ED) and kanji “口” (U+53E3) is fundamentally difficult. Even if the tokenizer treats them as separate tokens, semantically both “投入ロ” and “投入口” seem plausible. This is less an LLM limitation and more a structural problem with trying to recover information lost in the OCR “visual → text” conversion using only text. A multimodal model handling both image and text could potentially solve this.

“図書請求票” → “図書館票”:

7B made an incorrect judgment, pulled toward “図書館” (library). The specialized library term “図書請求票” (library request slip) apparently wasn’t sufficiently represented in 7B’s training data.

Model Size and Judgment Capability

Model	Size	Vocabulary	Conservative bias	Accuracy
Qwen2.5 1.5B	986MB	Low (destroys original)	None (changes anything)	Unmeasurable
Qwen2.5 3B	1.9GB	Low (doesn’t know 気送子)	Prompt-dependent, unstable	29%
Qwen2.5 7B	4.7GB	High (knows 気送子)	Appropriate (correctly conservative)	86%

The large jump from 3B to 7B is because 7B knows Showa-era library terminology. If the model can’t understand the meaning of a correction candidate, there’s no way to judge it correctly.

Pipeline with Escalation

7B accuracy is 86%, but there are cases like “ロ→口” that 7B misses. BERT detected this at 84% confidence.

So: when 7B and BERT disagree, escalate to a human.

Three-Level Decision Rules

AUTO-FIX: 7B says FIX → apply automatically
ESCALATE: 7B says KEEP but BERT confidence ≥ 50% → pass to human
AUTO-KEEP: 7B says KEEP and BERT confidence < 50% → keep automatically

Results

Line	Original	Candidate	BERT prob	7B verdict	Action	Correct?
6	優	後	73%	FIX	AUTO-FIX	Correct
17	交通	通行	87%	FIX	AUTO-FIX	Correct
21	っ	つ	100%	FIX	AUTO-FIX	Correct
21	請求	館	59%	FIX	AUTO-FIX	Wrong correction
5	ロ	口	84%	KEEP	ESCALATE	Human can catch it
0	##子	管	63%	KEEP	ESCALATE	Human → KEEP
2	気	空気	78%	KEEP	ESCALATE	Human → KEEP
3	##納	雪	53%	KEEP	ESCALATE	Human → KEEP
12	層	地	84%	KEEP	ESCALATE	Human → KEEP
0	送付	真空	35%	KEEP	AUTO-KEEP	Correct
4	落下	接続	32%	KEEP	AUTO-KEEP	Correct
10	呼ぶ	なる	31%	KEEP	AUTO-KEEP	Correct
16	##子	管	47%	KEEP	AUTO-KEEP	Correct
17	落下	バス	32%	KEEP	AUTO-KEEP	Correct

Summary

Automated: 9 of 14 cases (64%) processed automatically. Accuracy 89% (8/9 correct)
Escalated to human: 5 cases. Only 1 actually needed correction (“ロ→口”)
Missed: 0. The “ロ→口” that 7B missed gets routed to human via ESCALATE

The only automated wrong correction is “図書請求票→図書館票.” At BERT confidence 59%, raising the threshold to 60% would route this to ESCALATE too.

What Gets Escalated

The 5 human-review cases:

Line[5] “ロ→口”: Needs correction. A human recognizes it immediately
Line[0] “##子→管”: Cross-line noise → KEEP
Line[2] “気→空気”: Cross-line break (空\n気 newline) → KEEP
Line[3] “##納→雪”: “送付雪” is actually an OCR error (correctly: 送付管) but BERT’s candidate is off → human corrects separately
Line[12] “層→地”: “目的層” is correct in this document (層 = floor) → KEEP

From 131 initial BERT detections, filtered to 14, then 9 processed automatically — humans only need to check 5. Compared to eyeballing all 131 locations, this is a substantial reduction.

The BERT perplexity scan → 7B LLM judgment → human escalation on disagreement pipeline is a realistic configuration that runs on a single RTX 4060 Laptop with 8GB VRAM. It’s not fully automated, but it dramatically reduces human workload.

Two remaining walls. “Visually similar characters” (ロ/口) are structurally limited with a text-only approach — multimodal models or character-level OCR confidence scores are needed. “Domain-specific specialized terms” (図書請求票) are beyond 7B’s coverage — dictionary or terminology list lookups are the practical solution.

For closed-network deployment, NVIDIA Jetson Orin Nano (8GB unified memory, CUDA support, 7–15W power draw) could host this pipeline as-is. BERT + 7B LLM swap inference, NDLOCR-Lite ONNX Runtime, and ollama ARM64+CUDA support all work in Jetson’s Ubuntu-based environment. The Developer Kit is $249, but at current exchange rates it’s around 50,000–60,000 yen. A Raspberry Pi 5 (8GB) fully equipped also approaches 40,000 yen, so the price gap isn’t that large — and the Pi 5 can’t use CUDA, while eGPU + ROCm on ARM is also unsupported, making it useless for general compute.