Automated OCR Error Detection and Correction with Encoder Models + Local LLM
Background
I’d been using local LLMs like Qwen and Swallow to correct OCR text. They work to a degree, but have problems: filling based on context without understanding it (Qwen), or rewriting the text freely (Swallow).
So I tried an encoder-based approach using small models (LUKE/BERT) instead of generative AI (decoder-based). Encoder models look at context from both sides of a passage when filling in blanks — well-suited for “inferring missing characters without breaking the original text.”
Kyoto University’s WikipediaAnnotatedCorpus is a ~9,000-article corpus annotated with morphological analysis, syntactic parsing, case analysis, and coreference analysis. Fine-tuning LUKE on this could improve accuracy for Japanese particle consistency and implicit subject completion.
Experiment Environment
| Item | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 4060 Laptop (VRAM 8GB) |
| Main memory | 32GB |
| OS | Windows 11 (working in WSL2 Ubuntu 22.04) |
| Python | 3.10.12 (WSL2 side) |
| CUDA | 12.9 |
Checking the Environment
First, check the Windows side:
$ nvidia-smi
NVIDIA-SMI 576.02 Driver Version: 576.02 CUDA Version: 12.9
RTX 4060 Laptop GPU | 0MiB / 8188MiB | 0%
$ python --version
Python 3.14.0
$ wsl --list
Ubuntu-22.04 (Default)
Windows Python 3.14 isn’t supported by PyTorch yet, so I work on the WSL2 Ubuntu 22.04 side.
Checking inside WSL2:
$ python3 --version
Python 3.10.12
$ nvidia-smi
NVIDIA-SMI 575.51.02 Driver Version: 576.02 CUDA Version: 12.9
GPU visible from WSL2. Python 3.10 works with PyTorch. pip3 and venv weren’t installed, so add them.
WSL2 Setup
# Install pip3, venv
sudo apt update && sudo apt install -y python3-pip python3-venv
# Create virtual environment
python3 -m venv ~/luke-ocr
source ~/luke-ocr/bin/activate
# PyTorch (CUDA 12.4 compatible)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# transformers 4.x (5.x has a bug with MLukeTokenizer)
pip install 'transformers>=4.40,<5' sentencepiece protobuf tiktoken
I initially installed transformers 5.2.0, but MLukeTokenizer initialization threw TypeError: argument 'vocab': 'dict' object cannot be converted to 'Sequence'. Downgrading to 4.57.6 resolved it.
Post-install verification:
import torch
print(torch.__version__) # 2.6.0+cu124
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # NVIDIA GeForce RTX 4060 Laptop GPU
import transformers
print(transformers.__version__) # 4.57.6
LUKE Fill-mask Baseline Test
First, check fill-mask accuracy with plain LUKE before any fine-tuning. Using studio-ousia/luke-japanese-base-lite. VRAM usage is just 512MB.
from transformers import MLukeTokenizer, LukeForMaskedLM
import torch
model_name = "studio-ousia/luke-japanese-base-lite"
tokenizer = MLukeTokenizer.from_pretrained(model_name)
model = LukeForMaskedLM.from_pretrained(model_name).to("cuda")
Basic Tests
| Input | Top1 prediction | Confidence | Verdict |
|---|---|---|---|
織田信長は京都の本能寺で<mask>した。 | 死去 (died) | 32.0% | Close (“自害/suicide” would be ideal) |
東京は日本の<mask>である。 | 首都 (capital) | 60.1% | Correct |
太郎はパンを<mask>べた。 | た | 64.7% | Correct (“食”べた / ate) |
彼女は大学で<mask>を学んでいる。 | 数学 (mathematics) | 7.6% | Plausible (low confidence) |
OCR Defect Simulation
Testing patterns common in OCR errors — missing particles, inflectional suffixes, and proper nouns:
| Pattern | Input | Answer | Top1 | Confidence | Hit |
|---|---|---|---|---|---|
| Particle | 日本語<mask>文法は複雑である。 | の | : | 16.0% | x (の at rank 2: 10.2%) |
| Particle | 彼<mask>東京に住んでいる。 | は | もまた | 11.4% | x (は at rank 2: 10.0%) |
| Okurigana | 彼は会議に出<mask>した。 | 席 | 発 | 21.6% | x (席 at rank 2: 14.8%) |
| Okurigana | この問題を解<mask>するのは難しい。 | 決 | こうと | 20.0% | x |
| Proper noun | <mask>島県は四国にある。 | 徳 | - | 5.4% | x |
| Proper noun | ノーベル<mask>学賞を受賞した。 | 物理/etc. | 生理 | 34.1% | △ (生理学賞 is also valid) |
| Context | 彼は毎朝6時に<mask>きる。 | 起 | 起きて | 5.6% | △ (token boundary difference) |
| Context | この薬は食後に<mask>んでください。 | 飲 | 飲 | 84.6% | o |
Baseline Observations
- Strong context (medicine → 飲む/drink) achieves high accuracy without fine-tuning
- Particle inference gets the direction right but often fails to rank top
- Proper nouns (place names) are weak — single-character defects at sentence start are especially poor
- Mid-compound defects like “解決する” have poor compatibility with subword tokenization
The interesting question is how fine-tuning on the Kyoto University corpus changes accuracy for particles and case relations.
Preparing the Kyoto University Corpus
Clone WikipediaAnnotatedCorpus and examine the data:
git clone https://github.com/ku-nlp/WikipediaAnnotatedCorpus.git
Directory structure:
knp/: KNP-format annotated data (3,979 files)org/: Raw textid/: Train/dev/test splits (train 3,679, dev 100, test 200)
Inside KNP Format
Opening the Ashikaga Takauji Wikipedia article (wiki00010002.knp):
# S-ID:wiki00010002-00-01
* 3D
+ 1D
足利 あしかが 足利 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:head>
+ 8D <NE:PERSON:足利 尊氏>
尊氏 たかうじ 尊氏 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:tail>
は は は 助詞 9 副助詞 2 * 0 * 0 NIL
...
+ -1D <rel type="ガ" target="尊氏" .../>
武将 ぶしょう 武将 名詞 6 普通名詞 1 * 0 * 0 NIL
Each line contains a morpheme (surface form, reading, base form, part of speech). + lines contain case analysis (nominative, accusative, dative…). <NE:...> marks named entity tags.
Parsing with rhoknp
Using the rhoknp library recommended in the README (pyknp is outdated):
pip install rhoknp
from rhoknp import Document
with open("knp/wiki0001/wiki00010002.knp") as f:
doc = Document.from_knp(f.read())
for sent in doc.sentences:
text = "".join(m.text for m in sent.morphemes)
print(text)
# => 足利 尊氏は、鎌倉時代末期から室町時代前期の武将。
Case analysis is also accessible:
Sentence: 足利 尊氏は、鎌倉時代末期から室町時代前期の武将。
[末期から] --ノ--> 時代
[武将。] --ガ--> 尊氏
[武将。] --カラ--> 末期
Text Extraction
Extracted plain text from all KNP files, filtering out short sentences under 10 characters (like parenthetical readings):
| Split | Sentence count |
|---|---|
| train | 10,243 |
| dev | 312 |
| test | 545 |
Fine-tuning
Training Configuration
BATCH_SIZE = 8
EPOCHS = 3
LR = 5e-5
MAX_LEN = 128
MASK_PROB = 0.15 # Randomly mask 15% of tokens
Standard MLM (Masked Language Modeling): randomly replace 15% of input tokens with <mask> and predict the originals. Of those, 80% are masked, 10% are random tokens, 10% are kept as-is.
Training Results
| Epoch | Train Loss | Dev Loss | Time | VRAM Peak |
|---|---|---|---|---|
| 1 | 2.050 | 1.943 | 203s | 2,677 MB |
| 2 | 1.919 | 1.933 | 200s | 2,677 MB |
| 3 | 1.692 | 1.853 | 202s | 2,677 MB |
3 epochs total in ~10 minutes. Only 2.7GB of the 8GB VRAM used — room to increase batch size to 16 or 32.
Baseline vs. Fine-tuned Comparison
Same test sentences before and after fine-tuning. “o” if the correct answer is in the Top 3:
| Test | Answer | Baseline Top1 | FT Top1 | Improved |
|---|---|---|---|---|
本能寺で<mask>した | 死去 | 死去(32.0%) | 修行(20.7%) | x Worse |
日本の<mask>である | 首都 | 首都(60.1%) | 首都(85.6%) | o Large gain |
パンを<mask>べた | た | た(64.7%) | た(69.9%) | o Slight gain |
大学で<mask>を学ぶ | 数学/etc. | 数学(7.6%) | 数学(16.5%) | o Improved |
日本語<mask>文法は | の | :(16.0%) | では(13.7%) | x No change |
彼<mask>東京に | は | もまた(11.4%) | 自身は(38.2%) | x Worse |
出<mask>した | 席 | 発(21.6%) | 席(82.6%) | o Large gain |
解<mask>するのは | 決 | こうと(20.0%) | こうと(25.2%) | x No change |
<mask>島県は四国 | 徳 | -(5.4%) | (24.3%) | x No change |
ノーベル<mask>学賞 | 物理/etc. | 生理(34.1%) | 生理(75.5%) | △ Confidence up |
6時に<mask>きる | 起 | 起きて(5.6%) | 起(27.8%) | o Large gain |
食後に<mask>んで | 飲 | 飲(84.6%) | 飲(79.0%) | o Maintained |
Observations
Improved:
- “出席した”: 14.8% → 82.6% (moved to Top1)
- “首都”: 60.1% → 85.6% (confidence up)
- “起きる”: Outside Top3 → 27.8% at Top1 (large gain)
- “数学”: 7.6% → 16.5% (confidence doubled)
Degraded:
- “死去した”: 32.0% → 5.5%, “修行” moved to Top1
- Single-particle inference still weak
No change:
- Single-character defects in proper nouns (“徳島県”) still fail
- Mid-compound defects (“解決する”) also no improvement
Since the Kyoto corpus is Wikipedia encyclopedic text, encyclopedic fill-ins (“首都”, “出席”) improved substantially. Standalone particles and partial proper noun defects need a different approach — character-level models or OCR-specific training data.
The Wikipedia-Feeding-Wikipedia Problem
Realized here: LUKE’s pre-training data is Wikipedia + BookCorpus, and the Kyoto corpus is also extracted from ~4,000 Wikipedia articles. So I was feeding LUKE a domain it had already learned, without utilizing the annotations.
This training only used raw text from the corpus — the case analysis (nominative, accusative, dative relations), coreference analysis (pronoun referents), and named entity tags were all unused. The real value of the Kyoto corpus is those annotations, so getting proper value requires multi-task learning for case label prediction or named entity classification.
Directions for better results:
- Design tasks that actually use the annotations
- Train on non-Wikipedia domains (historical documents, government documents, etc.)
- Switch to a base model trained on non-Wikipedia data
Comparison with Tohoku University BERT v3
Also testing cl-tohoku/bert-base-japanese-v3 — same Wikipedia-based pre-training as LUKE, but different architecture and tokenizer:
- LUKE: SentencePiece tokenizer (subword splitting)
- Tohoku BERT: MeCab + WordPiece (morpheme-based splitting)
BERT v3 has the same Wikipedia overlap problem since it was pre-trained on Wikipedia + CC-100. But the tokenizer difference may affect results.
Additionally need fugashi (MeCab Python binding) and unidic-lite:
pip install fugashi unidic-lite
Training Results
| Epoch | Train Loss | Dev Loss | Time | VRAM Peak |
|---|---|---|---|---|
| 1 | 1.503 | 1.295 | 201s | 2,417 MB |
| 2 | 1.370 | 1.395 | 198s | 2,417 MB |
| 3 | 1.228 | 1.419 | 198s | 2,417 MB |
Overall lower loss than LUKE. However dev loss slightly increased from Epoch 2→3, showing a mild overfitting tendency.
Four-Model Comparison
Comparing all four patterns: LUKE (baseline / FT) and Tohoku BERT (baseline / FT):
| Test | Answer | LUKE raw | LUKE FT | BERT raw | BERT FT |
|---|---|---|---|---|---|
| 本能寺で○した | 死去 | 死去(32%) | 修行(21%) | 自害(46%) | 自害(34%) |
| 日本の○である | 首都 | 首都(60%) | 首都(86%) | 地名(51%) | 首都(67%) |
| パンを○べた | た | た(65%) | た(70%) | お(2%) | 焼く(23%) |
| 大学で○を学ぶ | 数学/etc. | 数学(8%) | 数学(17%) | 哲学(8%) | 哲学(14%) |
| 日本語○文法 | の | :(16%) | では(14%) | の(99%) | の(99%) |
| 彼○東京に | は | もまた(11%) | 自身は(38%) | は(81%) | は(86%) |
| 出○した | 席 | 発(22%) | 席(83%) | ##頭(62%) | ##頭(34%) |
| 解○するのは | 決 | こうと(20%) | こうと(25%) | と(34%) | ##法(48%) |
| ○島県は四国 | 徳 | -(5%) | (24%) | 淡路(4%) | 十(17%) |
| ノーベル○学賞 | 物理/etc. | 生理(34%) | 生理(76%) | 物理(80%) | 物理(51%) |
| 6時に○きる | 起 | 起きて(6%) | 起(28%) | しゃべり(7%) | 始まり(6%) |
| 食後に○んで | 飲 | 飲(85%) | 飲(79%) | 飲ま(44%) | お(34%) |
Model Characteristic Differences
Tohoku BERT uses a MeCab morpheme analysis tokenizer, so it’s overwhelmingly better at particle inference (“の”, “は”). Morpheme-level tokenization means particles exist as independent single tokens.
LUKE uses SentencePiece subword splitting, so it’s more flexible for character-level fill-in (“出○した” → “席”) where LUKE has an edge.
| Strength | LUKE | Tohoku BERT |
|---|---|---|
| Particle inference | Weak | Very strong |
| Character-level fill-in | Flexible | Subword effect (## prefix) |
| Proper nouns | Weak | Weak (both) |
| FT improvement range | Large gain in some cases | Limited (Wikipedia overlap) |
Neither model showed substantial improvement from Kyoto corpus fine-tuning. The Wikipedia overlap problem exists equally for Tohoku BERT.
Testing on Actual OCR Output
The fill-mask tests so far assumed we know where the <mask> goes. Real OCR doesn’t tell you “this is a defect” — a completely different approach is needed.
Setting Up NDLOCR-Lite
Installing the National Diet Library’s lightweight OCR NDLOCR-Lite in WSL2. For setup details see Running NDLOCR-Lite on Windows.
git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
pip install -r requirements.txt
Running OCR on a sample image (1963 National Diet Library staff manual). CPU-only, 1.5 seconds:
cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output /root/ocr-output
OCR Output and Misreads
Part of the output text:
(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。取扱いに
当っては気送子投入優すみやかにフタを閉め速度を調整
Visually confirmed misreads:
- “送付雪” → correct: “送付管”
- “投入優” → correct: “投入後”
- “投入ロ” → correct: “投入口” (katakana “ロ” → kanji “口”)
- “待成する” → correct: “待機する”
- “(z)” → correct: “(ヱ)“
Perplexity-Based Correction
Fill-mask assumes we know the <mask> position, but OCR correction requires finding “what’s wrong” first.
The approach: mask each token one at a time and calculate “the probability that this character belongs at this position.” Tokens with extremely low probability (threshold: under 1%) are contextually inconsistent — likely misreads.
def check_line(text, tokenizer, model, threshold=0.01):
encoding = tokenizer(text, return_tensors="pt")
input_ids = encoding["input_ids"][0]
suspects = []
for i in range(len(input_ids)):
# Replace token i with mask
masked_ids = input_ids.clone().unsqueeze(0).to("cuda")
original_id = masked_ids[0, i].item()
masked_ids[0, i] = tokenizer.mask_token_id
with torch.no_grad():
outputs = model(input_ids=masked_ids, ...)
probs = torch.softmax(outputs.logits[0, i], dim=-1)
original_prob = probs[original_id].item()
if original_prob < threshold:
# Suspicious token → generate correction candidates
top_tokens = probs.topk(3)
suspects.append(...)
return suspects
Misreads Detected by All Four Models
Ran the same OCR text through all four models (LUKE raw / LUKE FT / BERT raw / BERT FT) and compared detection on key misread locations:
| OCR text | Correct | LUKE raw | LUKE FT | BERT raw | BERT FT |
|---|---|---|---|---|---|
| 送付雪 | 管 | 箱(38%) | 機(29%) | 機(17%) | ポスト(9%) |
| 投入優 | 後 | 後(38%) | 後(44%) | 後(73%) | 後(84%) |
| 投入ロ | 口 | 口(61%) | 口(67%) | 口(84%) | 口(56%) |
All four models flagged all three locations as “suspicious” (probability under 1%). High reproducibility.
“優→後” was most confident with BERT FT at 84%. “ロ→口” was BERT raw at 84%. Both had the correct candidate as Top1.
“雪→管” didn’t produce “管” as a top candidate — “箱” or “機” instead. Not correct, but detection itself (“雪 is wrong here”) worked. Given “気送子送付管” is specialized terminology, a general language model preferring “機” or “箱” over “管” is reasonable.
Correction Observations
- The technique itself — perplexity-based detection — is more important than fine-tuning
- All four models flag the same locations as suspicious, minimizing model-to-model variation
- Tohoku BERT generally has higher correction candidate accuracy (“後” 73–84%, “口” 84%)
- Specialized term correction candidates go off-target (“送付管” isn’t in general vocabulary)
- False positives (flagging correct text as suspicious) are frequent: specialized terms like “気送子”, “ステーション”, and proper nouns all get flagged
Three-Stage Pipeline: BERT Detection → Small LLM Correction
BERT detection accuracy is sufficient, but having humans manually check every flagged location isn’t realistic. Can we combine a small local LLM to automate the detection→correction pipeline?
Setting Up ollama
Installing ollama in WSL2 and pulling small models:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:1.5b # 986MB
ollama pull qwen2.5:3b # 1.9GB
BERT uses ~2.4GB of VRAM, so inference runs separately. BERT unloads after detection, then ollama uses GPU for LLM inference.
v1: Request Corrections Per Line
First approach: detect suspicious locations with BERT, then send the entire line to LLM with that info and ask it to correct:
Result: Complete failure.
- 1.5B: Ignores original text and rewrites completely. “気送管” → “真空管”, “気送管による” → “気圧管による” — generates a different document
- 3B: Better than 1.5B, but still rewrites arbitrarily. “目的層の” → “目的地の”, “消える” → “点灯する” (opposite meaning)
Small LLMs asked to “correct the whole line” generate freely within the context window and can’t preserve the original. Same problem experienced with Qwen/Swallow before.
v2: Yes/No Judgment Per Character
For each token BERT detected, ask 3B individually: “Is this correction correct? YES or NO?”
Result: All NO.
Line[5] "ロ"→"口" (84%): LLM=NO -> Rejected
Line[6] "優"→"後" (73%): LLM=NO -> Rejected
All 35 cases NO — including correct corrections like “優→後” and “ロ→口”. Small models have a conservative bias: “when unsure, don’t change.” Binary YES/NO classification is a “high-stakes” task for small models, and they default to NO.
v3: A/B Selection + Enhanced Filtering
Instead of YES/NO, show the original and corrected lines side by side and ask “which is correct, A or B?” Also filter BERT detection results:
- Exclude cases where BERT’s Top1 candidate is punctuation, particle, or common verb (many cross-line false positives)
- Exclude Top1 probability under 30%
- Exclude
##-prefixed subword tokens
After filtering: 14 cases → LLM approved 5 corrections:
| Line | Original | Corrected | BERT prob | LLM verdict | Correct? |
|---|---|---|---|---|---|
| 6 | 優 | 後 | 73% | B (accepted) | Correct |
| 17 | 交通 | 通行 | 87% | B (accepted) | Correct |
| 4 | 落下 | 接続 | 32% | B (accepted) | Wrong correction |
| 3 | ##納 | 雪 | 53% | B (accepted) | Meaningless |
| 16 | ##子 | 管 | 47% | B (accepted) | Meaningless |
2 correct corrections, 1 wrong correction, 2 meaningless corrections from subword issues.
What Was Missed
- “投入ロ→口”: BERT detected at 84%, but LLM chose A (keep original). Katakana “ロ” (U+30ED) and kanji “口” (U+53E3) are visually similar — 3B can’t distinguish them
- “二っ→つ”: BERT detected at 100%, but LLM rejected
- “送付雪→管”: BERT’s Top1 wasn’t “管” so the correction candidate was off
Pipeline Assessment
The three-stage pipeline direction is correct, but 3B-class local models have accuracy limits.
Detection (BERT):
- Perplexity-based detection is stable. Four models flag the same locations — high reproducibility
- Threshold and filter tuning can reduce false positives to some extent
- But specialized terms (“気送子”, “ステーション”) will keep getting flagged structurally
Correction (small LLM):
- 1.5B: Destroys original. Unusable for correction
- 3B (whole line): Tends to rewrite original
- 3B (YES/NO): Conservative bias of all-NO
- 3B (A/B selection): Some correct judgments, but ~50% accuracy
Practically, presenting BERT’s detection results (suspicious locations + top candidates) to humans for final judgment is the optimal solution at this point. Full automation requires 7B+ models or OCR-specific fine-tuning.
7B Model as a Comeback
If 3B isn’t enough, what about 7B? Qwen2.5:7b is 4.7GB at 4-bit quantization — fits in 8GB VRAM if BERT is unloaded.
Does It Know “気送子”?
First, a vocabulary test:
Q: What is "気送子"?
3B: "気送子" is Chinese slang for someone who is gentle and considerate of others.
7B: "気送子" refers to a device or system that uses air to transport powders and fine particles.
3B is complete hallucination. 7B is accurate. Parameter count directly correlating with vocabulary is a textbook example.
3B vs 7B: Same Test Accuracy
Compared 3B and 7B judgments on the 14 correction candidates BERT detected, with human-assigned ground truth:
| Line | Original | Candidate | BERT prob | Correct | 3B | 7B |
|---|---|---|---|---|---|---|
| 0 | ##子 | 管 | 63% | KEEP | x | o |
| 0 | 送付 | 真空 | 35% | KEEP | x | o |
| 2 | 気 | 空気 | 78% | KEEP | x | o |
| 3 | ##納 | 雪 | 53% | KEEP | x | o |
| 4 | 落下 | 接続 | 32% | KEEP | x | o |
| 5 | ロ | 口 | 84% | FIX | o | x |
| 6 | 優 | 後 | 73% | FIX | o | o |
| 10 | 呼ぶ | なる | 31% | KEEP | x | o |
| 12 | 層 | 地 | 84% | KEEP | x | o |
| 16 | ##子 | 管 | 47% | KEEP | x | o |
| 17 | 落下 | バス | 32% | KEEP | x | o |
| 17 | 交通 | 通行 | 87% | FIX | o | o |
| 21 | 請求 | 館 | 59% | KEEP | x | x |
| 21 | っ | つ | 100% | FIX | o | o |
| Accuracy | 29% | 86% |
7B got 12/14 correct. A jump from 29% to 86%.
What 7B Got Right
For KEEP judgments (where the original should be preserved) that 3B got completely wrong, 7B got 9 out of 10 correct. Knowing “気送子”, “出納台”, “落下” as vocabulary means it can say “no, this is actually correct” even when BERT flags it as suspicious.
For FIX judgments, “優→後”, “交通→通行”, “っ→つ” were all correctly accepted.
Walls 7B Couldn’t Break Through
Katakana “ロ” vs. kanji “口”:
7B also judged KEEP (keep original). For text-based LLMs, distinguishing katakana “ロ” (U+30ED) and kanji “口” (U+53E3) is fundamentally difficult. Even if the tokenizer treats them as separate tokens, semantically both “投入ロ” and “投入口” seem plausible. This is less an LLM limitation and more a structural problem with trying to recover information lost in the OCR “visual → text” conversion using only text. A multimodal model handling both image and text could potentially solve this.
“図書請求票” → “図書館票”:
7B made an incorrect judgment, pulled toward “図書館” (library). The specialized library term “図書請求票” (library request slip) apparently wasn’t sufficiently represented in 7B’s training data.
Model Size and Judgment Capability
| Model | Size | Vocabulary | Conservative bias | Accuracy |
|---|---|---|---|---|
| Qwen2.5 1.5B | 986MB | Low (destroys original) | None (changes anything) | Unmeasurable |
| Qwen2.5 3B | 1.9GB | Low (doesn’t know 気送子) | Prompt-dependent, unstable | 29% |
| Qwen2.5 7B | 4.7GB | High (knows 気送子) | Appropriate (correctly conservative) | 86% |
The large jump from 3B to 7B is because 7B knows Showa-era library terminology. If the model can’t understand the meaning of a correction candidate, there’s no way to judge it correctly.
Pipeline with Escalation
7B accuracy is 86%, but there are cases like “ロ→口” that 7B misses. BERT detected this at 84% confidence.
So: when 7B and BERT disagree, escalate to a human.
Three-Level Decision Rules
- AUTO-FIX: 7B says FIX → apply automatically
- ESCALATE: 7B says KEEP but BERT confidence ≥ 50% → pass to human
- AUTO-KEEP: 7B says KEEP and BERT confidence < 50% → keep automatically
Results
| Line | Original | Candidate | BERT prob | 7B verdict | Action | Correct? |
|---|---|---|---|---|---|---|
| 6 | 優 | 後 | 73% | FIX | AUTO-FIX | Correct |
| 17 | 交通 | 通行 | 87% | FIX | AUTO-FIX | Correct |
| 21 | っ | つ | 100% | FIX | AUTO-FIX | Correct |
| 21 | 請求 | 館 | 59% | FIX | AUTO-FIX | Wrong correction |
| 5 | ロ | 口 | 84% | KEEP | ESCALATE | Human can catch it |
| 0 | ##子 | 管 | 63% | KEEP | ESCALATE | Human → KEEP |
| 2 | 気 | 空気 | 78% | KEEP | ESCALATE | Human → KEEP |
| 3 | ##納 | 雪 | 53% | KEEP | ESCALATE | Human → KEEP |
| 12 | 層 | 地 | 84% | KEEP | ESCALATE | Human → KEEP |
| 0 | 送付 | 真空 | 35% | KEEP | AUTO-KEEP | Correct |
| 4 | 落下 | 接続 | 32% | KEEP | AUTO-KEEP | Correct |
| 10 | 呼ぶ | なる | 31% | KEEP | AUTO-KEEP | Correct |
| 16 | ##子 | 管 | 47% | KEEP | AUTO-KEEP | Correct |
| 17 | 落下 | バス | 32% | KEEP | AUTO-KEEP | Correct |
Summary
- Automated: 9 of 14 cases (64%) processed automatically. Accuracy 89% (8/9 correct)
- Escalated to human: 5 cases. Only 1 actually needed correction (“ロ→口”)
- Missed: 0. The “ロ→口” that 7B missed gets routed to human via ESCALATE
The only automated wrong correction is “図書請求票→図書館票.” At BERT confidence 59%, raising the threshold to 60% would route this to ESCALATE too.
What Gets Escalated
The 5 human-review cases:
- Line[5] “ロ→口”: Needs correction. A human recognizes it immediately
- Line[0] “##子→管”: Cross-line noise → KEEP
- Line[2] “気→空気”: Cross-line break (空\n気 newline) → KEEP
- Line[3] “##納→雪”: “送付雪” is actually an OCR error (correctly: 送付管) but BERT’s candidate is off → human corrects separately
- Line[12] “層→地”: “目的層” is correct in this document (層 = floor) → KEEP
From 131 initial BERT detections, filtered to 14, then 9 processed automatically — humans only need to check 5. Compared to eyeballing all 131 locations, this is a substantial reduction.
The BERT perplexity scan → 7B LLM judgment → human escalation on disagreement pipeline is a realistic configuration that runs on a single RTX 4060 Laptop with 8GB VRAM. It’s not fully automated, but it dramatically reduces human workload.
Two remaining walls. “Visually similar characters” (ロ/口) are structurally limited with a text-only approach — multimodal models or character-level OCR confidence scores are needed. “Domain-specific specialized terms” (図書請求票) are beyond 7B’s coverage — dictionary or terminology list lookups are the practical solution.
For closed-network deployment, NVIDIA Jetson Orin Nano (8GB unified memory, CUDA support, 7–15W power draw) could host this pipeline as-is. BERT + 7B LLM swap inference, NDLOCR-Lite ONNX Runtime, and ollama ARM64+CUDA support all work in Jetson’s Ubuntu-based environment. The Developer Kit is $249, but at current exchange rates it’s around 50,000–60,000 yen. A Raspberry Pi 5 (8GB) fully equipped also approaches 40,000 yen, so the price gap isn’t that large — and the Pi 5 can’t use CUDA, while eGPU + ROCm on ARM is also unsupported, making it useless for general compute.
Other NDLOCR articles in this series: