Tech 15 min read

Automated OCR Error Detection and Correction with Encoder Models + Local LLM

Background

I’d been using local LLMs like Qwen and Swallow to correct OCR text. They work to a degree, but have problems: filling based on context without understanding it (Qwen), or rewriting the text freely (Swallow).

So I tried an encoder-based approach using small models (LUKE/BERT) instead of generative AI (decoder-based). Encoder models look at context from both sides of a passage when filling in blanks — well-suited for “inferring missing characters without breaking the original text.”

Kyoto University’s WikipediaAnnotatedCorpus is a ~9,000-article corpus annotated with morphological analysis, syntactic parsing, case analysis, and coreference analysis. Fine-tuning LUKE on this could improve accuracy for Japanese particle consistency and implicit subject completion.

Experiment Environment

ItemSpec
GPUNVIDIA GeForce RTX 4060 Laptop (VRAM 8GB)
Main memory32GB
OSWindows 11 (working in WSL2 Ubuntu 22.04)
Python3.10.12 (WSL2 side)
CUDA12.9

Checking the Environment

First, check the Windows side:

$ nvidia-smi
NVIDIA-SMI 576.02    Driver Version: 576.02    CUDA Version: 12.9
RTX 4060 Laptop GPU  |  0MiB / 8188MiB  |  0%

$ python --version
Python 3.14.0

$ wsl --list
Ubuntu-22.04 (Default)

Windows Python 3.14 isn’t supported by PyTorch yet, so I work on the WSL2 Ubuntu 22.04 side.

Checking inside WSL2:

$ python3 --version
Python 3.10.12

$ nvidia-smi
NVIDIA-SMI 575.51.02    Driver Version: 576.02    CUDA Version: 12.9

GPU visible from WSL2. Python 3.10 works with PyTorch. pip3 and venv weren’t installed, so add them.

WSL2 Setup

# Install pip3, venv
sudo apt update && sudo apt install -y python3-pip python3-venv

# Create virtual environment
python3 -m venv ~/luke-ocr
source ~/luke-ocr/bin/activate

# PyTorch (CUDA 12.4 compatible)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# transformers 4.x (5.x has a bug with MLukeTokenizer)
pip install 'transformers>=4.40,<5' sentencepiece protobuf tiktoken

I initially installed transformers 5.2.0, but MLukeTokenizer initialization threw TypeError: argument 'vocab': 'dict' object cannot be converted to 'Sequence'. Downgrading to 4.57.6 resolved it.

Post-install verification:

import torch
print(torch.__version__)        # 2.6.0+cu124
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0))  # NVIDIA GeForce RTX 4060 Laptop GPU

import transformers
print(transformers.__version__)  # 4.57.6

LUKE Fill-mask Baseline Test

First, check fill-mask accuracy with plain LUKE before any fine-tuning. Using studio-ousia/luke-japanese-base-lite. VRAM usage is just 512MB.

from transformers import MLukeTokenizer, LukeForMaskedLM
import torch

model_name = "studio-ousia/luke-japanese-base-lite"
tokenizer = MLukeTokenizer.from_pretrained(model_name)
model = LukeForMaskedLM.from_pretrained(model_name).to("cuda")

Basic Tests

InputTop1 predictionConfidenceVerdict
織田信長は京都の本能寺で<mask>した。死去 (died)32.0%Close (“自害/suicide” would be ideal)
東京は日本の<mask>である。首都 (capital)60.1%Correct
太郎はパンを<mask>べた。64.7%Correct (“食”べた / ate)
彼女は大学で<mask>を学んでいる。数学 (mathematics)7.6%Plausible (low confidence)

OCR Defect Simulation

Testing patterns common in OCR errors — missing particles, inflectional suffixes, and proper nouns:

PatternInputAnswerTop1ConfidenceHit
Particle日本語<mask>文法は複雑である。:16.0%x (の at rank 2: 10.2%)
Particle<mask>東京に住んでいる。もまた11.4%x (は at rank 2: 10.0%)
Okurigana彼は会議に出<mask>した。21.6%x (席 at rank 2: 14.8%)
Okuriganaこの問題を解<mask>するのは難しい。こうと20.0%x
Proper noun<mask>島県は四国にある。-5.4%x
Proper nounノーベル<mask>学賞を受賞した。物理/etc.生理34.1%△ (生理学賞 is also valid)
Context彼は毎朝6時に<mask>きる。起きて5.6%△ (token boundary difference)
Contextこの薬は食後に<mask>んでください。84.6%o

Baseline Observations

  • Strong context (medicine → 飲む/drink) achieves high accuracy without fine-tuning
  • Particle inference gets the direction right but often fails to rank top
  • Proper nouns (place names) are weak — single-character defects at sentence start are especially poor
  • Mid-compound defects like “解決する” have poor compatibility with subword tokenization

The interesting question is how fine-tuning on the Kyoto University corpus changes accuracy for particles and case relations.

Preparing the Kyoto University Corpus

Clone WikipediaAnnotatedCorpus and examine the data:

git clone https://github.com/ku-nlp/WikipediaAnnotatedCorpus.git

Directory structure:

  • knp/: KNP-format annotated data (3,979 files)
  • org/: Raw text
  • id/: Train/dev/test splits (train 3,679, dev 100, test 200)

Inside KNP Format

Opening the Ashikaga Takauji Wikipedia article (wiki00010002.knp):

# S-ID:wiki00010002-00-01
* 3D
+ 1D
足利 あしかが 足利 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:head>
+ 8D <NE:PERSON:足利 尊氏>
尊氏 たかうじ 尊氏 名詞 6 人名 5 * 0 * 0 NIL <NE:PERSON:tail>
は は は 助詞 9 副助詞 2 * 0 * 0 NIL
...
+ -1D <rel type="ガ" target="尊氏" .../>
武将 ぶしょう 武将 名詞 6 普通名詞 1 * 0 * 0 NIL

Each line contains a morpheme (surface form, reading, base form, part of speech). + lines contain case analysis (nominative, accusative, dative…). <NE:...> marks named entity tags.

Parsing with rhoknp

Using the rhoknp library recommended in the README (pyknp is outdated):

pip install rhoknp
from rhoknp import Document

with open("knp/wiki0001/wiki00010002.knp") as f:
    doc = Document.from_knp(f.read())

for sent in doc.sentences:
    text = "".join(m.text for m in sent.morphemes)
    print(text)
    # => 足利 尊氏は、鎌倉時代末期から室町時代前期の武将。

Case analysis is also accessible:

Sentence: 足利 尊氏は、鎌倉時代末期から室町時代前期の武将。
  [末期から] --ノ--> 時代
  [武将。]   --ガ--> 尊氏
  [武将。]   --カラ--> 末期

Text Extraction

Extracted plain text from all KNP files, filtering out short sentences under 10 characters (like parenthetical readings):

SplitSentence count
train10,243
dev312
test545

Fine-tuning

Training Configuration

BATCH_SIZE = 8
EPOCHS = 3
LR = 5e-5
MAX_LEN = 128
MASK_PROB = 0.15  # Randomly mask 15% of tokens

Standard MLM (Masked Language Modeling): randomly replace 15% of input tokens with <mask> and predict the originals. Of those, 80% are masked, 10% are random tokens, 10% are kept as-is.

Training Results

EpochTrain LossDev LossTimeVRAM Peak
12.0501.943203s2,677 MB
21.9191.933200s2,677 MB
31.6921.853202s2,677 MB

3 epochs total in ~10 minutes. Only 2.7GB of the 8GB VRAM used — room to increase batch size to 16 or 32.

Baseline vs. Fine-tuned Comparison

Same test sentences before and after fine-tuning. “o” if the correct answer is in the Top 3:

TestAnswerBaseline Top1FT Top1Improved
本能寺で<mask>した死去死去(32.0%)修行(20.7%)x Worse
日本の<mask>である首都首都(60.1%)首都(85.6%)o Large gain
パンを<mask>べたた(64.7%)た(69.9%)o Slight gain
大学で<mask>を学ぶ数学/etc.数学(7.6%)数学(16.5%)o Improved
日本語<mask>文法は:(16.0%)では(13.7%)x No change
<mask>東京にもまた(11.4%)自身は(38.2%)x Worse
<mask>した発(21.6%)席(82.6%)o Large gain
<mask>するのはこうと(20.0%)こうと(25.2%)x No change
<mask>島県は四国-(5.4%)(24.3%)x No change
ノーベル<mask>学賞物理/etc.生理(34.1%)生理(75.5%)△ Confidence up
6時に<mask>きる起きて(5.6%)起(27.8%)o Large gain
食後に<mask>んで飲(84.6%)飲(79.0%)o Maintained

Observations

Improved:

  • “出席した”: 14.8% → 82.6% (moved to Top1)
  • “首都”: 60.1% → 85.6% (confidence up)
  • “起きる”: Outside Top3 → 27.8% at Top1 (large gain)
  • “数学”: 7.6% → 16.5% (confidence doubled)

Degraded:

  • “死去した”: 32.0% → 5.5%, “修行” moved to Top1
  • Single-particle inference still weak

No change:

  • Single-character defects in proper nouns (“徳島県”) still fail
  • Mid-compound defects (“解決する”) also no improvement

Since the Kyoto corpus is Wikipedia encyclopedic text, encyclopedic fill-ins (“首都”, “出席”) improved substantially. Standalone particles and partial proper noun defects need a different approach — character-level models or OCR-specific training data.

The Wikipedia-Feeding-Wikipedia Problem

Realized here: LUKE’s pre-training data is Wikipedia + BookCorpus, and the Kyoto corpus is also extracted from ~4,000 Wikipedia articles. So I was feeding LUKE a domain it had already learned, without utilizing the annotations.

This training only used raw text from the corpus — the case analysis (nominative, accusative, dative relations), coreference analysis (pronoun referents), and named entity tags were all unused. The real value of the Kyoto corpus is those annotations, so getting proper value requires multi-task learning for case label prediction or named entity classification.

Directions for better results:

  1. Design tasks that actually use the annotations
  2. Train on non-Wikipedia domains (historical documents, government documents, etc.)
  3. Switch to a base model trained on non-Wikipedia data

Comparison with Tohoku University BERT v3

Also testing cl-tohoku/bert-base-japanese-v3 — same Wikipedia-based pre-training as LUKE, but different architecture and tokenizer:

  • LUKE: SentencePiece tokenizer (subword splitting)
  • Tohoku BERT: MeCab + WordPiece (morpheme-based splitting)

BERT v3 has the same Wikipedia overlap problem since it was pre-trained on Wikipedia + CC-100. But the tokenizer difference may affect results.

Additionally need fugashi (MeCab Python binding) and unidic-lite:

pip install fugashi unidic-lite

Training Results

EpochTrain LossDev LossTimeVRAM Peak
11.5031.295201s2,417 MB
21.3701.395198s2,417 MB
31.2281.419198s2,417 MB

Overall lower loss than LUKE. However dev loss slightly increased from Epoch 2→3, showing a mild overfitting tendency.

Four-Model Comparison

Comparing all four patterns: LUKE (baseline / FT) and Tohoku BERT (baseline / FT):

TestAnswerLUKE rawLUKE FTBERT rawBERT FT
本能寺で○した死去死去(32%)修行(21%)自害(46%)自害(34%)
日本の○である首都首都(60%)首都(86%)地名(51%)首都(67%)
パンを○べたた(65%)た(70%)お(2%)焼く(23%)
大学で○を学ぶ数学/etc.数学(8%)数学(17%)哲学(8%)哲学(14%)
日本語○文法:(16%)では(14%)の(99%)の(99%)
彼○東京にもまた(11%)自身は(38%)は(81%)は(86%)
出○した発(22%)席(83%)##頭(62%)##頭(34%)
解○するのはこうと(20%)こうと(25%)と(34%)##法(48%)
○島県は四国-(5%)(24%)淡路(4%)十(17%)
ノーベル○学賞物理/etc.生理(34%)生理(76%)物理(80%)物理(51%)
6時に○きる起きて(6%)起(28%)しゃべり(7%)始まり(6%)
食後に○んで飲(85%)飲(79%)飲ま(44%)お(34%)

Model Characteristic Differences

Tohoku BERT uses a MeCab morpheme analysis tokenizer, so it’s overwhelmingly better at particle inference (“の”, “は”). Morpheme-level tokenization means particles exist as independent single tokens.

LUKE uses SentencePiece subword splitting, so it’s more flexible for character-level fill-in (“出○した” → “席”) where LUKE has an edge.

StrengthLUKETohoku BERT
Particle inferenceWeakVery strong
Character-level fill-inFlexibleSubword effect (## prefix)
Proper nounsWeakWeak (both)
FT improvement rangeLarge gain in some casesLimited (Wikipedia overlap)

Neither model showed substantial improvement from Kyoto corpus fine-tuning. The Wikipedia overlap problem exists equally for Tohoku BERT.

Testing on Actual OCR Output

The fill-mask tests so far assumed we know where the <mask> goes. Real OCR doesn’t tell you “this is a defect” — a completely different approach is needed.

Setting Up NDLOCR-Lite

Installing the National Diet Library’s lightweight OCR NDLOCR-Lite in WSL2. For setup details see Running NDLOCR-Lite on Windows.

git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
pip install -r requirements.txt

Running OCR on a sample image (1963 National Diet Library staff manual). CPU-only, 1.5 seconds:

cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output /root/ocr-output

OCR Output and Misreads

Part of the output text:

(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。取扱いに
当っては気送子投入優すみやかにフタを閉め速度を調整

Visually confirmed misreads:

  • “送付” → correct: “送付
  • “投入” → correct: “投入
  • “投入” → correct: “投入” (katakana “ロ” → kanji “口”)
  • “待する” → correct: “待する”
  • “(z)” → correct: “(ヱ)“

Perplexity-Based Correction

Fill-mask assumes we know the <mask> position, but OCR correction requires finding “what’s wrong” first.

The approach: mask each token one at a time and calculate “the probability that this character belongs at this position.” Tokens with extremely low probability (threshold: under 1%) are contextually inconsistent — likely misreads.

def check_line(text, tokenizer, model, threshold=0.01):
    encoding = tokenizer(text, return_tensors="pt")
    input_ids = encoding["input_ids"][0]
    suspects = []

    for i in range(len(input_ids)):
        # Replace token i with mask
        masked_ids = input_ids.clone().unsqueeze(0).to("cuda")
        original_id = masked_ids[0, i].item()
        masked_ids[0, i] = tokenizer.mask_token_id

        with torch.no_grad():
            outputs = model(input_ids=masked_ids, ...)

        probs = torch.softmax(outputs.logits[0, i], dim=-1)
        original_prob = probs[original_id].item()

        if original_prob < threshold:
            # Suspicious token → generate correction candidates
            top_tokens = probs.topk(3)
            suspects.append(...)
    return suspects

Misreads Detected by All Four Models

Ran the same OCR text through all four models (LUKE raw / LUKE FT / BERT raw / BERT FT) and compared detection on key misread locations:

OCR textCorrectLUKE rawLUKE FTBERT rawBERT FT
送付箱(38%)機(29%)機(17%)ポスト(9%)
投入後(38%)後(44%)後(73%)後(84%)
投入口(61%)口(67%)口(84%)口(56%)

All four models flagged all three locations as “suspicious” (probability under 1%). High reproducibility.

→後” was most confident with BERT FT at 84%. “→口” was BERT raw at 84%. Both had the correct candidate as Top1.

→管” didn’t produce “管” as a top candidate — “箱” or “機” instead. Not correct, but detection itself (“雪 is wrong here”) worked. Given “気送子送付管” is specialized terminology, a general language model preferring “機” or “箱” over “管” is reasonable.

Correction Observations

  • The technique itself — perplexity-based detection — is more important than fine-tuning
  • All four models flag the same locations as suspicious, minimizing model-to-model variation
  • Tohoku BERT generally has higher correction candidate accuracy (“後” 73–84%, “口” 84%)
  • Specialized term correction candidates go off-target (“送付管” isn’t in general vocabulary)
  • False positives (flagging correct text as suspicious) are frequent: specialized terms like “気送子”, “ステーション”, and proper nouns all get flagged

Three-Stage Pipeline: BERT Detection → Small LLM Correction

BERT detection accuracy is sufficient, but having humans manually check every flagged location isn’t realistic. Can we combine a small local LLM to automate the detection→correction pipeline?

Setting Up ollama

Installing ollama in WSL2 and pulling small models:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:1.5b  # 986MB
ollama pull qwen2.5:3b    # 1.9GB

BERT uses ~2.4GB of VRAM, so inference runs separately. BERT unloads after detection, then ollama uses GPU for LLM inference.

v1: Request Corrections Per Line

First approach: detect suspicious locations with BERT, then send the entire line to LLM with that info and ask it to correct:

Result: Complete failure.

  • 1.5B: Ignores original text and rewrites completely. “気送管” → “真空管”, “気送管による” → “気圧管による” — generates a different document
  • 3B: Better than 1.5B, but still rewrites arbitrarily. “目的層の” → “目的地の”, “消える” → “点灯する” (opposite meaning)

Small LLMs asked to “correct the whole line” generate freely within the context window and can’t preserve the original. Same problem experienced with Qwen/Swallow before.

v2: Yes/No Judgment Per Character

For each token BERT detected, ask 3B individually: “Is this correction correct? YES or NO?”

Result: All NO.

Line[5] "ロ"→"口" (84%): LLM=NO -> Rejected
Line[6] "優"→"後" (73%): LLM=NO -> Rejected

All 35 cases NO — including correct corrections like “優→後” and “ロ→口”. Small models have a conservative bias: “when unsure, don’t change.” Binary YES/NO classification is a “high-stakes” task for small models, and they default to NO.

v3: A/B Selection + Enhanced Filtering

Instead of YES/NO, show the original and corrected lines side by side and ask “which is correct, A or B?” Also filter BERT detection results:

  • Exclude cases where BERT’s Top1 candidate is punctuation, particle, or common verb (many cross-line false positives)
  • Exclude Top1 probability under 30%
  • Exclude ##-prefixed subword tokens

After filtering: 14 cases → LLM approved 5 corrections:

LineOriginalCorrectedBERT probLLM verdictCorrect?
673%B (accepted)Correct
17交通通行87%B (accepted)Correct
4落下接続32%B (accepted)Wrong correction
3##納53%B (accepted)Meaningless
16##子47%B (accepted)Meaningless

2 correct corrections, 1 wrong correction, 2 meaningless corrections from subword issues.

What Was Missed

  • “投入”: BERT detected at 84%, but LLM chose A (keep original). Katakana “ロ” (U+30ED) and kanji “口” (U+53E3) are visually similar — 3B can’t distinguish them
  • “二”: BERT detected at 100%, but LLM rejected
  • “送付”: BERT’s Top1 wasn’t “管” so the correction candidate was off

Pipeline Assessment

The three-stage pipeline direction is correct, but 3B-class local models have accuracy limits.

Detection (BERT):

  • Perplexity-based detection is stable. Four models flag the same locations — high reproducibility
  • Threshold and filter tuning can reduce false positives to some extent
  • But specialized terms (“気送子”, “ステーション”) will keep getting flagged structurally

Correction (small LLM):

  • 1.5B: Destroys original. Unusable for correction
  • 3B (whole line): Tends to rewrite original
  • 3B (YES/NO): Conservative bias of all-NO
  • 3B (A/B selection): Some correct judgments, but ~50% accuracy

Practically, presenting BERT’s detection results (suspicious locations + top candidates) to humans for final judgment is the optimal solution at this point. Full automation requires 7B+ models or OCR-specific fine-tuning.

7B Model as a Comeback

If 3B isn’t enough, what about 7B? Qwen2.5:7b is 4.7GB at 4-bit quantization — fits in 8GB VRAM if BERT is unloaded.

Does It Know “気送子”?

First, a vocabulary test:

Q: What is "気送子"?

3B: "気送子" is Chinese slang for someone who is gentle and considerate of others.
7B: "気送子" refers to a device or system that uses air to transport powders and fine particles.

3B is complete hallucination. 7B is accurate. Parameter count directly correlating with vocabulary is a textbook example.

3B vs 7B: Same Test Accuracy

Compared 3B and 7B judgments on the 14 correction candidates BERT detected, with human-assigned ground truth:

LineOriginalCandidateBERT probCorrect3B7B
0##子63%KEEPxo
0送付真空35%KEEPxo
2空気78%KEEPxo
3##納53%KEEPxo
4落下接続32%KEEPxo
584%FIXox
673%FIXoo
10呼ぶなる31%KEEPxo
1284%KEEPxo
16##子47%KEEPxo
17落下バス32%KEEPxo
17交通通行87%FIXoo
21請求59%KEEPxx
21100%FIXoo
Accuracy29%86%

7B got 12/14 correct. A jump from 29% to 86%.

What 7B Got Right

For KEEP judgments (where the original should be preserved) that 3B got completely wrong, 7B got 9 out of 10 correct. Knowing “気送子”, “出納台”, “落下” as vocabulary means it can say “no, this is actually correct” even when BERT flags it as suspicious.

For FIX judgments, “優→後”, “交通→通行”, “っ→つ” were all correctly accepted.

Walls 7B Couldn’t Break Through

Katakana “ロ” vs. kanji “口”:

7B also judged KEEP (keep original). For text-based LLMs, distinguishing katakana “ロ” (U+30ED) and kanji “口” (U+53E3) is fundamentally difficult. Even if the tokenizer treats them as separate tokens, semantically both “投入ロ” and “投入口” seem plausible. This is less an LLM limitation and more a structural problem with trying to recover information lost in the OCR “visual → text” conversion using only text. A multimodal model handling both image and text could potentially solve this.

“図書請求票” → “図書館票”:

7B made an incorrect judgment, pulled toward “図書館” (library). The specialized library term “図書請求票” (library request slip) apparently wasn’t sufficiently represented in 7B’s training data.

Model Size and Judgment Capability

ModelSizeVocabularyConservative biasAccuracy
Qwen2.5 1.5B986MBLow (destroys original)None (changes anything)Unmeasurable
Qwen2.5 3B1.9GBLow (doesn’t know 気送子)Prompt-dependent, unstable29%
Qwen2.5 7B4.7GBHigh (knows 気送子)Appropriate (correctly conservative)86%

The large jump from 3B to 7B is because 7B knows Showa-era library terminology. If the model can’t understand the meaning of a correction candidate, there’s no way to judge it correctly.

Pipeline with Escalation

7B accuracy is 86%, but there are cases like “ロ→口” that 7B misses. BERT detected this at 84% confidence.

So: when 7B and BERT disagree, escalate to a human.

Three-Level Decision Rules

  1. AUTO-FIX: 7B says FIX → apply automatically
  2. ESCALATE: 7B says KEEP but BERT confidence ≥ 50% → pass to human
  3. AUTO-KEEP: 7B says KEEP and BERT confidence < 50% → keep automatically

Results

LineOriginalCandidateBERT prob7B verdictActionCorrect?
673%FIXAUTO-FIXCorrect
17交通通行87%FIXAUTO-FIXCorrect
21100%FIXAUTO-FIXCorrect
21請求59%FIXAUTO-FIXWrong correction
584%KEEPESCALATEHuman can catch it
0##子63%KEEPESCALATEHuman → KEEP
2空気78%KEEPESCALATEHuman → KEEP
3##納53%KEEPESCALATEHuman → KEEP
1284%KEEPESCALATEHuman → KEEP
0送付真空35%KEEPAUTO-KEEPCorrect
4落下接続32%KEEPAUTO-KEEPCorrect
10呼ぶなる31%KEEPAUTO-KEEPCorrect
16##子47%KEEPAUTO-KEEPCorrect
17落下バス32%KEEPAUTO-KEEPCorrect

Summary

  • Automated: 9 of 14 cases (64%) processed automatically. Accuracy 89% (8/9 correct)
  • Escalated to human: 5 cases. Only 1 actually needed correction (“ロ→口”)
  • Missed: 0. The “ロ→口” that 7B missed gets routed to human via ESCALATE

The only automated wrong correction is “図書請求票→図書館票.” At BERT confidence 59%, raising the threshold to 60% would route this to ESCALATE too.

What Gets Escalated

The 5 human-review cases:

  • Line[5] “ロ→口”: Needs correction. A human recognizes it immediately
  • Line[0] “##子→管”: Cross-line noise → KEEP
  • Line[2] “気→空気”: Cross-line break (空\n気 newline) → KEEP
  • Line[3] “##納→雪”: “送付雪” is actually an OCR error (correctly: 送付管) but BERT’s candidate is off → human corrects separately
  • Line[12] “層→地”: “目的層” is correct in this document (層 = floor) → KEEP

From 131 initial BERT detections, filtered to 14, then 9 processed automatically — humans only need to check 5. Compared to eyeballing all 131 locations, this is a substantial reduction.


The BERT perplexity scan → 7B LLM judgment → human escalation on disagreement pipeline is a realistic configuration that runs on a single RTX 4060 Laptop with 8GB VRAM. It’s not fully automated, but it dramatically reduces human workload.

Two remaining walls. “Visually similar characters” (ロ/口) are structurally limited with a text-only approach — multimodal models or character-level OCR confidence scores are needed. “Domain-specific specialized terms” (図書請求票) are beyond 7B’s coverage — dictionary or terminology list lookups are the practical solution.

For closed-network deployment, NVIDIA Jetson Orin Nano (8GB unified memory, CUDA support, 7–15W power draw) could host this pipeline as-is. BERT + 7B LLM swap inference, NDLOCR-Lite ONNX Runtime, and ollama ARM64+CUDA support all work in Jetson’s Ubuntu-based environment. The Developer Kit is $249, but at current exchange rates it’s around 50,000–60,000 yen. A Raspberry Pi 5 (8GB) fully equipped also approaches 40,000 yen, so the price gap isn’t that large — and the Pi 5 can’t use CUDA, while eGPU + ROCm on ARM is also unsupported, making it useless for general compute.

Other NDLOCR articles in this series: